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Preface 



DISC, the International Symposium on Distributed Computing, is an annual 
forum for research presentations on all facets of distributed computing. DISC 
2001 was held on Oct 3-5, 2001, in Lisbon, Portugal. This volume includes 23 
contributed papers. It is expected that these papers will be submitted in more 
polished form to fully refereed scientific journals. The extended abstracts of this 
year’s invited lectures, by Gerard LeLann and David Peleg, will appear in next 
year’s proceedings. 

We received 70 regular submissions. These submissions were read and evalua- 
ted by the program committee, with the help of external reviewers when needed. 
Overall, the quality of the submissions was excellent, and we were unable to ac- 
cept many deserving papers. 

This year’s Best Student Paper award goes to Yong-Jik Kim for the paper 
“A Time Complexity Bound for Adaptive Mutual Exclusion” by Yong-Jik Kim 
and James H. Anderson. 
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A Time Complexity Bound for Adaptive Mutual 

Exclusion* 

(Extended Abstract) 

Yong-Jik Kim and James H. Anderson 

Department of Computer Science 
University of North Carolina at Chapel Hill 



Abstract. We consider the time complexity of adaptive mutual exclu- 
sion algorithms, where “time” is measured by counting the number of 
remote memory references required per critical-section access. We estab- 
lish a lower bound that precludes a deterministic algorithm with 0(log k) 
time complexity (in fact, any deterministic o(fc) algorithm), where k is 
“point contention.” In contrast, we show that expected O(logfc) time is 
possible using randomization. 



1 Introduction 



In this paper, we consider the time complexity of adaptive mutual exclusion al- 
gorithms. A mutual exclusion algorithm is adaptive if its time complexity is a 
function of the number of contending processes IdKiliSIlOlllI . Under the time com- 
plexity measure considered in this paper, only remote memory references that 
cause a traversal of the global processor-to-memory interconnect are counted. 
Specifically, we count the number of such references generated by one process p 
in a computation that starts when p becomes active (leaves its noncritical sec- 
tion) and ends when p once again becomes inactive (returns to its noncritical 
section). Unless stated otherwise, we let k denote the “point contention” over 
such a computation (the point contention over a computation H is the maximum 
number of processes that are active at the same state in iJ PJ)- Throughout this 
paper, we let N denote the number of processes in the system. 

In recent work, we presented an adaptive mutual exclusion algorithm — 
henceforth called Algorithm AK — with O {min {k, log N)) time complexity 
IP]. Algorithm AK requires only read/write atomicity and is the only such 
algorithm known to us that is adaptive under the remote-memory-references 
time complexity measure. In other recent work, we established a worst-case 
time bound of f?(log A/loglog V) for mutual exclusion algorithms (adaptive or 
not) based on reads, writes, or comparison primitives such as test-and-set and 
compare-and-swap m- (A comparison primitive conditionally updates a shared 
variable after first testing that its value meets some condition.) This result shows 
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that the 0(log N) worst-case time complexity of Algorithm AK is close to op- 
timal. In fact, we believe it is optimal: we conjecture that l7(log A) is a tight 
lower bound for this class of algorithms. 

If l7(log N) is a tight lower bound, then presumably a lower bound of f2(log k) 
would follow as well. This suggests two interesting possibilities: in all likelihood, 
either {min {k, log N)) is a, tight lower bound («.e.. Algorithm AK is optimal), 
or it is possible to design an adaptive algorithm with 0(logA:) time complexity 
{i.e., il{logk) is tight). Indeed, the problem of designing an O(logfc) algorithm 
using only reads and writes has been mentioned in two recent papers |3p . 

In this paper, we show that an 0(logA:) algorithm in fact does not exist. In 
particular, we prove the following: For any k, there exists some N such that, for 
any N -process mutual exclusion algorithm based on reads, writes, or comparison 
primitives, a computation exists involving 0{k) processes in which some process 
performs f2{k) remote memory references to enter and exit its critical section. 

Although this result precludes a deterministic O(logfc) algorithm (in fact, 
any deterministic o{k) algorithm), we show that a randomized algorithm does 
exist with expected 0(log k) time complexity. This algorithm is obtained through 
a simple modification to Algorithm AK. 

The rest of the paper is organized as follows. In Sec. El our system model is 
defined. Our lower bound proof is presented in Secs. I.SI41 The radomized algo- 
rithm mentioned above is sketched in Sec.0 We conclude in Sec. ??. 



2 Definitions 

Our model of a shared-memory system is based on that used in m- 

Shared-memory systems. A shared-memory system S = {C, P, V) consists of a set 
of computations C, a set of processes P, and a set of variables V. A computation 
is a finite sequence of events. 

An event e is denoted [R, W,p], where p € P. The sets R and W consist of 
pairs {v,a), where v gV. This notation represents an event of process p that 
reads the value a from variable v for each element {v,a) G R, and writes the 
value a to variable v for each element {v, a) G W. Each variable in R (or W) 
is assumed to be distinct. We define Rvar{e), the set of variables read by e, 
to be {t I (v,a) G i?}, and Wvar(e), the set of variables written by e, to be 
{v I (v,a) G W}. We also define var{e), the set of all variables accessed by e, to 
be Rvar{e) U Wvar{e). We say that this event accesses each variable in var{e), 
and that process p is the owner of e, denoted owner{e) = p. For brevity, we 
sometimes use Cp to denote an event owned by process p. 

Each variable is local to at most one process and is remote to all other 
processes. (Note that we allow variables that are remote to all processes.) An 
initial value is associated with each variable. An event is local if it does not 
access any remote variable, and is remote otherwise. 

We use (e, . . .) to denote a computation that begins with the event e, and 
0 to denote the empty computation. We use H o G to denote the computation 
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obtained by concatenating computations H and G. The value of variable v at the 
end of computation H, denoted value{v, H), is the last value written to u in iJ 
(or the initial value of u if u is not written in H) . The last event to write to w in iJ 
is denoted writer -event {v, H), and its owner is denoted writer{v, H). (Although 
our definition of an event allows multiple instances of the same event, we assume 
that such instances are distinguishable from each other.) If v is not written by 
any event in H, then we let writeriy, H) = T and writer -event {v, H) = T. 

For a computation H and a set of processes Y, H \ Y denotes the subcom- 
putation of H that contains all events in H of processes in Y. Computations 
H and G are equivalent with respect to T iff i? | F = G | T. A computation H 
is a Y -computation H = H\Y. For simplicity, we abbreviate the preceding 

definitions when applied to a singleton set of processes. For example, if F = {p}, 
then we use H \ pto mean H \ {p} and p-computation to mean {p}-computation. 

The following properties apply to any shared-memory system. 

(PI) If H G G and G is a prefix of H, then G G G. 

(P2) If H o (cp) G C, G G G, G\p = H \p, and if value{v,G) = value{v,H) 
holds for all v G Rvar{ep), then G o (cp) G G. 

(P3) If H o (cp) gC,GgC,G\p = H\p, then G o (cp) G G for some event e'p 
such that Rvar{e'p) = Rvar(ep) and Wvar{e'p) = Wvar{ep). 

(P4) For any H G C, H o (cp) G G implies that a = value{v, H) holds, for all 
(u, a) G R, where Cp = [i?, W,p]. 

For notational simplicity, we make the following assumption, which requires 
each remote event to be either an atomic read or an atomic write. 

Atomicity Assumption: Each event of a process p may either read or write 
(but not both) at most one variable that is remote to p. □ 

As explained later, this assumption actually can be relaxed to allow compar- 
ison primitives. 

Mutual exclusion systems. We now define a special kind of shared-memory sys- 
tem, namely mutual exclusion systems, which are our main interest. 

A mutual exclusion system S = (G, P, V) is a shared-memory system that 
satisfies the following properties. Each process p G P has a local variable statp 
ranging over {ncs, entry, exit} and initially ncs. statp is accessed only by the 
events Enterp = [{}, {{statp, entry)}, p], CSp = [{}, {{statp, exit)}, p], and 
Exitp = [{}, {{statp, ncs)}, p], and is updated only as follows: for all H G G, 

H o {Enterp) G G iff value{statp, El) = ncs; 

H o (CSp) G G only if value{statp, H) = entry; 

H o (Exitp) G G only if value{statp, H) = exit. 

(Note that statp transits directly from entry to exit.) 

In our proof, we only consider computations in which each process enters 
and then exits its critical section at most once. Thus, we henceforth assume that 
each computation contains at most one Enterp event for each process p. The 
remaining requirements of a mutual exclusion system are as follows. 
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Exclusion: For all H & C, ii both H o (CSp) £ C and H o (CSq) £ C hold, 
then p = q. 

Progress (starvation freedom): For all H £C,if value{statp, H) ^ ncs, then 
there exists an AT-computation G such that H o Go (e^) £ C, where X = {q £ 
P I value (stat g, H) ^ ncs} and Cp is either CSp (if value{statp, H) = entry) 
or Exitp (if value{statp, H) = exit). □ 

Cache- coherent systems. On cache-coherent shared-memory systems, some re- 
mote variable accesses may be handled without causing interconnect traffic. Our 
lower-bound proof applies to such systems without modification. This is because 
we do not count every remote event, but only critical events, as defined below. 

Definition 1. Let S = (G, P, V) be a mutual exclusion system. Let Cp be an 
event in H £ C . Then, we can write H as F o (e^) o G, where F and G are 
subcomputations of H . We say that Cp is a critical event in H iff one of the 
following conditions holds: 

State transition event: Cp is one of Enter p, CSp, or Exitp. 

Critical read: There exists a variable v, remote to p, such that v G Rvar{cp) 
and F \ p does not contain a read from v. 

Critical write: There exists a variable v, remote to p, such that v G Wvar{ep) 
and writer{v, F) ^ p. □ 

Note that state transition events do not actually cause cache misses; these 
events are defined as critical events because this allows us to combine certain 
cases in the proofs that follow. A process executes only three transition events 
per critical-section execution, so this does not affect our asymptotic lower bound. 

According to Definition 1, a remote read of v by p is critical if it is the first 
read of u by p. A remote write of w by p is critical if (i) it is the first write of v by 
p (which implies that either writer{v, F) = q ^ p holds or writer{v, F) = F ^ p 
holds); or (ii) some other process has written v since p’s last write of v (which 
also implies that writer (v, F) ^ p holds). 

Note that if p both reads and writes v, then both its first read of v and first 
write of v are considered critical. Depending on the system implementation, the 
latter of these two events might not generate a cache miss. However, even in such 
a case, the first such event will always generate a cache miss, and hence at least 
half of all such critical reads and writes will actually incur real global traffic. 
Hence, our lower bound remains asymptotically unchanged for such systems. 

In a write-through cache scheme, writes always generate a cache miss. With 
a write-back scheme, a remote write to a variable v may create a cached copy of 
V, so that subsequent writes to v do not cause cache misses. In Definition I, if Cp 
is not the first write to v by p, then it is considered critical only if writer{v, F) = 
q p holds, which implies that v is stored in the local cache line of another 
process q. (Effectively, we are assuming an idealized cache of infinite size: a 
cached variable may be updated or invalidated but it is never replaced by another 
variable. Note that writer{v,F) = q implies that q's cached copy of v has not 
been invalidated.) In such a case, Cp must either invalidate or update the cached 
copy of V (depending on the system), thereby generating global traffic. 
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Note that the definition of a critical event depends on the particular com- 
putation that contains the event, specifically the prefix of the computation pre- 
ceding the event. Therefore, when saying that an event is (or is not) critical, the 
computation containing the event must be specified. 



3 Proof Strategy 

In Sec. E] we show that for any positive k, there exists some N such that, for any 
mutual exclusion system S = (C, P, V) with |P| > TV, there exists a computation 
H such that some process p experiences point contention k and executes at 
least k critical events to enter and exit its critical section. The proof focuses 
on a special class of computations called “regular” computations. A regular 
computation consists of events of two groups of processes, “active processes” 
and “finished processes.” Informally, an active process is a process in its entry 
section, competing with other active processes; a finished process is a process 
that has executed its critical section once, and is in its noncritical section. (These 
properties follow from (R4), given later in this section.) 

Definition 2. Let S = (C, P, V) be a mutual exclusion system, and H be a 

computation in C. We define Act{H), the set of active processes in H, and 

Fin(P), the set of finished processes in H , as follows. 

Act(P) = {p€P\H\p^Q and (Exitp) is not in H} 

Fin(P) = {p G P I P I p yf () and {Exitp) is in H} □ 

The proof proceeds by inductively constructing longer and longer regular 
computations, until the desired lower bound is attained. The regularity condi- 
tion defined below ensures that no participating process has knowledge of any 
other process that is active. This has two consequences: (i) we can “erase” any 
active process (be., remove its events from the computation) and still get a valid 
computation; (ii) “most” active processes have a “next” critical event. In the 
definition that follows, (Rl) ensures that active processes have no knowledge 
of each other; (R2) and (R3) bound the number of possible conflicts caused by 
appending a critical event; (R4) ensures that the active and finished processes 
behave as explained above; (R5) ensures that the property of being a critical 
write is conserved when considering certain related computations. 

Definition 3. Let S = (C, P, V) be a mutual exclusion system, and H be a 
computation in C . We say that H is regular iff the following conditions hold. 

(Rl) For any event Cp and fg in H, where p q, if p writes to a variable v, 
and if another process q reads that value from v, then p € Fin(P). 

(R2) Lf a process p accesses a variable that is local to another process q, then 
q ^ Act(P). 

(R3) For any variable v, if v is accessed by more than one processes in Act{H), 
then either writer{v, H) = A or writer{v, H) £ Fin(P) holds. 
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(R4) For any process p that participates in H {H\p ^ ()), value{statp, H) is 
entry, if p € Act(iJ), and ncs otherwise {i.e., p S Fm{H)). Moreover, if 
p G Fin(iJ), then the last event of p in H is Exitp. 

(R5) Consider two events Cp and fp such that Cp precedes fp in H, both Cp and 
fp write to a variable v, and fp is a critical write to v in H . In this case, 
there exists a write to v by some process r in Fin(i7) between Cp and fp. □ 

Proof overview. Initially, we start with a regular computation Hi, where Act(-ffi) 
= P, Fin(7?i) = {}, and each process has exactly one critical event. We then in- 
ductively show that other longer computations exist, the last of which establishes 
our lower bound. Each computation is obtained by rolling some process forward 
to its noncritical section (NCS) or by erasing some processes — this basic proof 
strategy has been used previously to prove several other lower bounds for con- 
current systems mm- We assume that P is large enough to ensure that 
enough non-erased processes remain after each induction step for the next step 
to be applied. The precise bound on |P| is given in Theorem 2. 

At the induction step, we consider a computation Hj such that Act{Hj) 
consists of n processes that execute j critical events each. We construct a regular 
computation i?j+i such that Act(iFj+i) consists of 12{^/njk) processes that 
execute j -I- 1 critical events each. The construction method, formally described 
in Lemma 4, is explained below. In constructing F/j+i from Hj, some processes 
may be erased and at most one rolled forward. At the end of step A: — 1, we have 
a regular computation Hi~ in which each active process executes k critical events 
and I Fin(iJfc)| < fc — 1. Since active processes have no knowledge of each other, a 
computation involving at most k processes can be obtained from by erasing 
all but one active process; the remaining process performs k critical events. 

We now describe how is constructed from Hj. We show in Lemma 3 

that, among the n processes in Act(Hj), at least n — 1 can execute an additional 
critical event prior to its critical section. We call these events “future” critical 
events, and denote the corresponding set of processes by Y. We consider two 
cases, based on the variables remotely accessed by these future critical events. 

Erasing strategy. Assume that Q{^Jn) distinct variables are remotely accessed 
by the future critical events. For each such variable v, we select one process whose 
future critical event accesses v, and erase the rest. Let Y' be the set of selected 
processes. We now eliminate any information flow among processes in Y' by 
constructing a “conflict graph” Q as follows. 

Each process p in Y' is considered a vertex in Q. By induction, process p has 
j critical events in Act(Hj) and one future critical event. An edge (p,q), where 
p q, is included in Q (i) if the future critical event of p remotely accesses a 
local variable of process q, or (ii) if one of p’s j -I- 1 critical events accesses the 
same variable as the future critical event of process q. 

Since each process in Y' accesses a distinct remote variable in its future 
critical event, it is clear that each process generates at most one edge by rule (i) 
and at most j + I edges by rule (ii). By applying Turan’s theorem (Theorem 1), 
we can And a subset Z of Y' such that \Z\ = j) and their critical events 
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do not conflict with each other. By retaining Z and erasing all other active 
processes, we can eliminate all conflicts. Thus, we can construct Hj+i. 

Roll-forward strategy. Assume that the number of distinct variables that are 
remotely accessed by the future critical events is 0{^yn). Since there are 0(n) 
future critical events, there exists a variable v that is remotely accessed by future 
critical events of processes. Let Yy be the set of these processes. First, we 

retain and erase all other active processes. Let the resulting computation be 
H' . We then arrange the future critical events of Yy by placing all writes before 
all reads. In this way, the only information flow among processes in Yy is that 
from the “last writer” of v to all the subsequent readers (of v). Let phw be the 
last writer. We then roll ppw forward by generating a regular computation G 
from H' such that Fin(G) = Fin(iL') U {plw}- 

If Plw executes at least k critical events before reaching its NCS, then the 
f2{k) lower bound easily follows. Therefore, we can assume that puw performs 
fewer than k critical events while being rolled forward. Each critical event of ppw 
that is appended to H' may generate information flow only if it reads a variable 
V that is written by another process in H' . Condition (R3) guarantees that if 
there are multiple processes that write to v, the last writer in H' is not active. 
Because information flow from an inactive process is allowed, a conflict arises 
only if there is a single process that writes to v in H'. Thus, each critical event 
of Phw conflicts with at most one process in Yy, and hence can erase at most 
one process. (Appending a noncritical event to H' cannot cause any processes to 
be erased. In particular, if a noncritical remote read by ppw is appended, then 
Pun must have previously read the same variable. By (R3), if the last writer is 
another process, then that process is not active.) 

Therefore, the entire roll-forward procedure erases fewer than k processes 
from Act(iL') = Yy. We can assume |P| is sufficiently large to ensure that ^/n > 
2k. This ensures that processes survive after the entire procedure. Thus, 

we can construct iLj+i. 

4 Lower Bound for Systems with Read/ Write Atomicity 



In this section, we present our lower-bound theorem for systems satisfying the 
Atomicity Assumption. At the end of this section, we explain why the lower 
bound also holds for systems with comparison primitives. We begin by stating 
several lemmas. Lemma 1 states that we can safely “erase” any active process. 
Lemma 2 allows us to extend a computation by noncritical events. Lemma 3 is 
used to show that “most” active processes have a “next” critical event. 

Lemma 1. Consider a regular computation H in C. For any set Y of pro- 
cesses such that Fin(iL) C Y, the following hold: H \ Y G C , H \ Y is regular, 
Fin(iL I Y) = Fin(iJ), and an event e in H\Y is a critical event iff it is also a 
critical event in H. □ 
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Lemma 2. Consider a regular computation H in C, and a set of processes Y = 
{pi, p 2 , ■ • ■ , Pm}, where Y C Act{H). Assume that for eachpj in Y, there exists 
a pj -computation Lp., such that H o Lp- G C and Lp. has no critical events in 
H o Lp. . Define L to be Lp.^ o Lp^ o • • • o Lp^ . Then, the following hold: H o L G C , 
H oL is regular, Fin(iJ o L) = Fin(iJ), and L has no critical events in H o L. □ 

Lemma 3. Let H be a regular computation in C. Define n = | Act(77)|. Then, 
there exists a subset Y of Aci{H), where n — 1 < |Y| < n, satisfying the following: 
for each process p in Y , there exist a p-computation Lp and an event Cp of p 
such that 

• H o LpO (cp) G C; 

• Lp contains no critical events in H o Lp; 

• Cp ^ {Entevp, CSp, Exitp}; 

• Cp is a critical event of p in H o Lp o (Cp); 

• H o Lp is regular; 

• Fin(i7 o Lp) = Fin(H). □ 

The next theorem by Turan m will be used in proving Lemma 4. 

Theorem 1 (Turan). Let Q = (V,E) be an undirected graph, where V is a set 
of vertices and E is a set of edges. If the average degree of Q is d, then there 
exists an independent se0 with at least |"|t4|/(d+ 1)] vertices. □ 

The following lemma provides the induction step of our lower-bound proof. 

Lemma 4. Let S = {C,P,V) be a mutual exclusion system, k be a positive 
integer, and H be a regular computation in C. Define n = | Act(iL)|. Assume 
that n > 1 and 

• each process in Act(iL) executes exactly c critical events in H . (1) 

Then, one of the following propositions is true. 

(Prl) There exist a process p G Act(iF) and a computation F gC such that 

• F o {Exitp) G C; 

• F does not contain {Exitp); 

• at most m processes participate in F , where m = \ Fin(iL) -I- 1|; 

• p executes at least k critical events in F . 

(Pr2) There exists a regular computation G in C such that 

• Act(G) C Act(iL); 

• I Fin(G)| < I Fin(iL) -|- 1|; 

• I Act(G) I > min(-yn/(2c -I- 3), y/n — k); (2) 

• each process in Act(G) executes exactly c-l- 1 critical events in G. 

Proof. Because H is regular, using Lemma 3, we can construct a subset Y of 
Act(iL) such that 

n — 1 <\Y\ <n, (3) 

and for each p gY , there exist a p-computation Lp and an event Cp such that 

^ An independent set of a graph Q — {V, E) is a subset V' GV such that no edge in 
E is incident to two vertices in V' . 
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• H o LpO (cp) G C; (4) 

• Lp contains no critical events in iJ o Lp; (5) 

• 6p ^ {Enterp, CSp, Exitp}; (6) 

• Cp is a critical event of p in El o Lp o (cp); (7) 

• El o Lp is regular; (8) 

• Fin(i7 oLp) = Fin(i7). (9) 

Define Vfut as the set of variables remotely accessed by the “future” critical 

events: 

14ut = {f G y I there exists p &Y such that Cp remotely accesses t;}. (10) 



We consider two cases, depending on the size of Ffut- 

Case 1: |14ut| > (erasing strategy) By definition, for each variable v in 
there exists a process p in Y such that Cp remotely accesses v. Therefore, we 
can arbitrarily select one such process for each variable v in Vfut and construct 
a subset Y' of Y such that 

• if p £ Y' , q G Y' and p^ q, then Cp and Cq access different remote variables, 

and (11) 

• |V'| = IVfutI > y/n. (12) 

We now construct a graph Q = (Y\Eg), where each vertex is a process in 

Y' . To each process y in Y', we apply the following rules. 

(Gl) Let V G Vfut be the variable remotely accessed by Cy. If is local to a 
process z in Y', then introduce edge (y,z). 

(G2) For each critical event / of y in H, let Vf be the variable remotely accessed 
by /. ff Vf G 14ut and Vf is remotely accessed by event for some process 
z ^ y in Y' , then introduce edge (y, z). 

Because each variable is local to at most one process, and since an event can 
access at most one remote variable, (Gl) can introduce at most one edge per 
process. Since, by m, y executes exactly c critical events in H, by (inj, (G2) 
can introduce at most c edges per process. 

Gombining (Gl) and (G2), at most c + 1 edges are introduced per process. 
Thus, the average degree of Q is at most 2(c+ 1). Hence, by Theorem 1, there 
exists an independent set Z QY' such that 

• 1^1 > \Y'\/{2c + 3) > V^/(2c + 3), (13) 

where the latter inequality follows from (II2D- 

Next, we construct a computation G, satisfying (Pr2), such that Act(G) = 
\Z\. Let H' — H \ {Z VJ Fin(iL)). For each process z £ Z, ^ implies H o L^ £ C . 
Hence, by (0) and (0), and applying Lemma 1 with ‘H’ H o Lz and ‘Y’ ■£- 
Z U Fin(iL), we have the following: 

• H' o Lz £ C (which, by (PI), implies IE' £ C), and 

• an event in H' o Lz is critical iff it is also critical in H o Lz- (14) 

By (0), the latter also implies that Lz contains no critical events in H' o Lz- 
Let m = \Z\ and index the processes in Z a,s Z = {zi, Z2, • • ■ , Zm}- Define 

L = Lz^ o Lz 2 o • • • o Lz^. By applying Lemma 2 with ‘IE’ £- H' and ‘Y’ £- Z, 
we have the following: 
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• H' oL€C, 

• H' o L is regular, and 

• L contains no critical events in H' o L. (15) 

By the definition of H' and L, we also have 

• for each z & Z, {H' o L)\z = {H o L^) \ z. (16) 

Therefore, by 0) and Property (P3), for each Zj G Z, there exists an event 

e(, . , such that 

• G G C, where G = H' o L o E and E = (ei,^, e^.^, . . . , e(,^); 

• Rvar{e'^.) = Rvar(ez^), Wvar{e'^.) = Wvar{ez.), and owner{e'^.) = 
owner{ezj) = zj. 

Conditions (R1)-(R5) can be individually checked to hold in G, which implies 
that G is a regular computation. Since Z GY' GY G Act{H), by (PJ, (II 411 . and 
(uni, each process in Z executes exactly c critical events in H' o L. 

We now show that every event in E is critical in G. Note that, by (0), is 
a critical event in H o Lz o (Cz)- By 0, is not a transition event. By 
the events of z are the same in both H o Lz and H' o L. Thus, if 6z is a critical 
read or a “first” critical write in H o Lz o (e^), then it is also critical in G. The 
only remaining case is that Cz writes some variable v remotely, and is critical in 
H o Lz o (cz) because of a write to v prior to Cz by another process not in G. 
However, (R5) ensures that in such a case there exists some process in Fin(iJ) 
that writes to v before Cz, and hence Cz is also critical in G. 

Thus, we have constructed a computation G that satisfies the following: 
Act(G) = Z G Act(iJ), Fin(G) = Fin(H') = Fin(H) (from the definition of H', 
and since L o E does not contain transition events), | Act(G)| > ^Jnj{2c + 3) 
(from (| I djl ) . and each process in Act(G) executes exactly c + 1 critical events in 
G (from the preceding paragraph). It follows that G satisfies (Pr2). 

Case 2: |Vfut| < \/n (roll- forward strategy) For each variable Vj in Vfut, define 
Yy. = {p gY \ 6p remotely accesses vj}. By 0 and (jl 1)1) . |Vfut| < \/n implies 
that there exists a variable vj in l^ut such that \Yy^ \ > {n — l)j^/n holds. Let v 
be one such variable. Then, the following holds: 



\Yv\^ {n — l)/y/n> y/n — 1. (17) 

Define H' = H \ (Yy U Fin(iL)). Using Yy GY G Act(iL), we also have 

Act{H') = YyGAct{H) A Fin(£T') = Fin(iJ). (18) 

Because is regular, by Lemma 1, 

• H' gG, (19) 

• H' is regular, and (20) 

• an event in iJ' is a critical event iff it is also a critical event in H. (21) 



We index processes in Yy from y\ to j/m, where m = IWI, such that if Cy. 
writes to v and reads v, then i < j (i.e., future writes to v precede future 
reads from v). 

For each y GYy, let Fy = {H o Ly) \ {Yy U Fin(iJ)). ( 0 ) implies H o Ly G G. 
Hence, by (0, and applying Lemma 1 with ‘FT G- HoLy and W G- F„UFin(iF), 
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we have the following: Fy G C, and an event in Fy is critical iff it is also critical 
in iJ o Ly. Since y G Yy and Ly is a y-computation, by the definition of FI' , 
Fy = H' o Ly. Hence, by ©, we have 

• H' oLyGC, and (22) 

• Ly does not have a critical event in H' o Ly. (23) 

Define L = Ly.^ o o • • • o Ly^. We now use Lemma 2, with ‘LT ^ H' and 

W ^ Yy. The antecedent of the lemma follows from (CHI, (Cni, (ED, (Ea), and 



This gives us the following. 

• H' oLeC, 

• H' o L is regular, (24) 

• Fin(iL' oL)= Fin(iL), and (25) 

• L contains no critical events in H' o L. (26) 

By the definition of H' and L, we also have 

• for each y G {H' o L)\y = {H o Ly) \ y. (27) 



Therefore, by m and Property (P3), for each yj G Yy, there exists an event 
e'y . , such that 

• G G C, where G = H' o L o E and E = (e^^, e'y^, . . . , e^^); 

• Rvar{e'y.) = Rvar{ey.), Wvar{e'y.) = Wvar{ey^), and owner{e'y.) = 
owner (cy.) = yj. 

From © and ®, it follows that L o E does not contain any transition 
events. Moreover, by the definition of L and E, {L o E)\p ^ {) implies p G Yy, 
for each process p. Combining these assertions with we have 

Act(G) = Act(iJ' oL) = Act(iL') = Yy A 
Fin(G) = Fin(i7' o L) = Fin(H') = Fm{H). (28) 

We now claim that each process in Yy (= Act(G)) executes exactly c + 1 
critical events in G. In particular, by JU), (HHI), (ED, and (ED), it follows that 
each process in Yy executes exactly c critical events in H' oL. On the other hand, 
by O, ey is a critical event in H o Ly o (cy) . By (ED, and using an argument 
that is similar to that at the end of Case 1, we can prove that each event e'y in 
if is a critical event in G. 

Let plw be the last process to write to u in G (if such a process exists). If 
Plw does not exist or if ppw S Fin(G) = Fin(ii), conditions (RI)-(R5) can be 
individually checked to hold in G, which implies that G is a regular computation. 
In this case, ED and (ED imply that G satisfies (Pr2). 

Therefore, assume ppw £ Act(G) = Yy. Define iLuw = {H' o L) \ (Fin(iL) U 
{PLw})- By (ED, (I2D, and applying Lemma 1 with H' o L and ^Y’ G- 

Fin(iL') U {plw}, we have: Hlw £ G, Hlw is regular, and Act(iLLw) = {plw}- 

Since ppw is the only active process in Hun, by the Progress property, there 
exists a pLW-computation E such that Hlw o F o [Exitp^y^) G G and E contains 
exactly one CSp^^^fj and no Exitp^y,^. If E contains k or more critical events in 
HunoF, then HunoF satisfies (Prl). Therefore, we can assume that F contains 
at most k — 1 critical events in HunoL. Let lAw be the set of variables remotely 
accessed by these critical events. 
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If a process q in Yy writes to a variable in Vlw in H' , it might generate 
information flow between q and plw- Therefore, define K, the set of processes 
to erase (or “kill”), as AT = {p G — {plw} I P = writer{u, H') or u is local to p 
for some u G 14,w}- (R2) ensures that each variable in Vlw introduces at most 
one process into K. Thus, we have \K\ < |VlwI < k — 1. 

Define S, the “survivors,” as S = Yy — K, and let Hs = (H'oL) \ (Fin(iJ)U5'). 
By O, and applying Lemma 1 with ‘H’ H' o L and ‘Y’ ^ Fin(iL') U S, we 
have the following: Hs G C, Hs is regular, and Act(_ff 5 ) = S. By the definition 
of L, we also have Hs \ y = {H' o L)\y, for each y £ S. Since every event in E 
accesses only local variables and v, by (P2), we have Hs o E £ C . 

Note that (P2) also implies that the first event of F is Hence, we can 
write F as o F' . Define G = Hs o E o F' . Note that {Hs o E) |plw = 
{Hlw I Plw) o (CpLw)’ events of F' cannot read any variable written by 

processes in S other than plw itself. Therefore, by inductively applying (P2) 
over the events of F' , we have G £ C . 

Conditions (R1)-(R5) can be individually checked to hold in G, which implies 
that G is a regular computation such that Fin(G) = Fin(iL) U {plw}- Moreover, 
by and from \K\ < fc— 1, we have | Act(G)| = [S'] > (v^— 1) — (fc— 1) > y/n—k. 
Thus, G satisfies (Pr2). □ 

Theorem 2. Let N{k) = {2k + 1)^^^ For any mutual exclusion system 
S = (G, P, V) and for any positive number k, if |P| > N{k), then there exists a 
computation H such that at most k processes participate in H and some process 
p executes at least k critical operations in H to enter and exit its critical section. 

Proof. Let Hi = {Enteri, Enters, ■ • • , Enters), where P = {1, 2, . . . , N} and 
N > N{k). By the definition of a mutual exclusion system. Hi £ G. It is obvious 
that Hi is regular and each process in Act{H) = P has exactly one critical event 
in Hi . Starting with Hi , we repeatedly apply Lemma 4 and construct a sequence 
of computations Hi, H 2 , . . . such that each process in Act{Hj) has j critical 
events in Hj. We repeat the process until either Hk is constructed or some Hj 
satisfies (Prl) of Lemma 4. 

If some Hj {j < k) satisfies (Prl), then consider the first such j. By our 
choice of j, each of Pi, . . . , Pj-i satisfies (Pr2) of Lemma 4. Therefore, since 
|Fin(Pi)| = 0, we have |Fin(Pj)| < j — 1 < fc. It follows that computation 
F o (Exitp), generated by applying Lemma 4 to Hj, satisfies Theorem 2. 

The remaining possibility is that each of Pi, . . . , Pfe_i satisfies (Pr2). We 
claim that, for 1 < j < fc, the following holds: 

|Act(Pj)| > ( 2 fc + l)2(2''+'"^-i). (29) 

The induction basis (j = 1) directly follows from Act(P) = P and |P| > N{k). 
In the induction step, assume that (OS) holds for some j (1 < j < fc), and let 
Uj = I Act (Pj) I . Note that each active process in Hj executes exactly j critical 
events. By (12^11 . we also have Uj > 4fc^, which implies that ydij — fc > ^Jnjj2 > 
,Jnf/{2k + 1). Therefore, by (0, we have | Act(Pj_|_i)| > min(yny/(2j + 3), 
yfn] — k) > y/fijl{2k + 1), from which the induction easily follows. 
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(a) 

Fig. 1. (a) The splitter element and 



shared variable 

X: {T} U {0..A — 1} initially T; 
Y : boolean initially false 

private variable 

dir. {stop, left, right} 

1: X~p- 

2: if Y then dir ~ left 
else 

3: Y := true-, 

4: if X = p then dir := stop 

5: else dir := right 

fi fi 

(b) 

the code fragment that implements it. 



Finally, (12hll implies | Act(i7fe)| > 1, and (Pr2) implies |Fin(iJ^,)| < A: — 1. 
Hence, select any arbitrary process p from Act{Hk). Define G = Hk\ (Fm{Hk) U 
{p}). Clearly, at most k processes participate in G. By applying Lemma 1 with 
‘H’ ^ Hk and ^Y’ ^ Fin(TLfe) U {p}, we have the following: G G G, and an event 
in G is critical iff it is also critical in Hk- Hence, because p executes k critical 
events in H^, G is a computation that satisfies Theorem 2. □ 

Theorem 2 can be easily strengthened to apply to systems in which compari- 
son primitives are allowed. The key idea is this: if several comparison primitives 
on some shared variable are currently enabled, then they can be applied in an or- 
der in which at most one succeeds. A comparison primitive can be treated much 
like an ordinary write if successful, and like an ordinary read if unsuccessful. 

5 Randomized Algorithm 

In this section, we describe the randomized version of Algorithm AK men- 
tioned earlier. Due to space constraints, only a high-level description of Algo- 
rithm AK is included here. A full description can be found in 0. 

At the heart of Algorithm AK is the splitter element from Lamport’s fast 
mutual exclusion algorithm ||. The splitter element is defined in Fig. Q Each 
process that invokes a splitter either stops or moves left or right (as indicated 
by the value assigned to the variable dir). Splitters are useful because of the 
following properties: if n processes invoke a splitter, then at most one of them 
can stop at that splitter, and at most n — 1 can move left (respectively, right). 

In Algorithm AK, splitter elements are used to construct a “renaming 
tree.” A splitter is located at each node of the tree and corresponds to a “name.” 
A process acquires a name by moving down through the tree, starting at the 
root, until it stops at some splitter. Within the renaming tree is an arbitration 
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Fig. 2. (a) Renaming tree and overflow tree, (b) Process p gets a name in the renaming 
tree, (c) Process q fails to get a name and must compete within the overflow tree. 

tree that forms dynamically as processes acquire names. A process competes 
within the arbitration tree by moving up to the root, starting at the node where 
it acquired its name. Associated with each node in the tree is a three-process 
mutual exclusion algorithm. As a process moves up the tree, it executes the entry 
section associated with each node it visits. After completing its critical section, 
a process retraces its path, this time executing exit sections. A three-process 
mutual exclusion algorithm is needed at each node to accommodate one process 
from each of the left- and right-subtrees beneath that node and any process that 
may have successfully acquired a name at that node. 

In Algorithm AK, the renaming tree’s height is defined to be [logA^J, 
which results in a tree with 0{N) nodes. With a tree of this height, a process 
could “fall off” the end of the tree without acquiring a name. To handle such 
processes, a second arbitration tree, called the “overflow tree,” is used. The 
renaming and overflow trees are connected by placing a two-process mutual 
exclusion algorithm on top of each tree, as illustrated in Fig. El 

The time complexity of Algorithm AK is determined by the depth to which 
a process descends in the renaming tree. If the point contention experienced by 
a process p is k, then the depth to which p descends is 0{k). This is because, of 
the processes that access the same splitter, all but one may move in the same 
direction from that splitter. If all the required mutual exclusion algorithms are 
implemented using Yang and Anderson’s local-spin algorithm d, then because 
the renaming and overflow trees are both of height 0(log A), overall time com- 
plexity is 0{min{k,logN)). 

Our new randomized algorithm is obtained from Algorithm AK by replac- 
ing the original splitter with a probabilistic version, which is obtained by using 
“dir := choice{left, righty in place of the assignments to dir at lines 2 and 5 
in Fig. m, where choice {left, right) returns left (right) with probability 1/2. 
With this change, a process descends to an expected depth of 6?(log k) in the 
renaming tree. Thus, the algorithm has 6>(log k) expected time complexity. 

6 Concluding Remarks 

We have established a lower bound that precludes an O(logfc) adaptive mutual 
exclusion algorithm (in fact, any o{k) algorithm) based on reads, writes, or com- 
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parison primitives. We have also shown that expected 0{logk) time is possible 
using randomization. 

One may wonder whether a f2{min{k,logN/loglogN)) lower bound follows 
the results of this paper and Pj. Unfortunately, the answer is no. We have shown 
that f2(k) time complexity is required provided Nis sufficiently large. This leaves 
open the possibility that an algorithm might have 0(k) time complexity for very 
“low” levels of contention, but o{k) time complexity for “intermediate” levels of 
contention. However, we find this highly unlikely. 
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Abstract. We propose a quorum system, which we referred to as the 
surficial quorum system, for group mutual exclusion. The surficial quo- 
rum system is geometrically evident and so is easy to construct. It also 
has a nice structure based on which a truly distributed algorithm for 
group mutual exclusion can be obtained, and processes’ loads can be 
minimized. When used with Maekawa’s algorithm, the surficial quorum 
system allows up to processes to access a resource simulta- 

neously, where n is the total number of processes, and m is the total 
number of groups. We also present two modifications of Maekawa’s al- 
gorithm so that the number of processes that can access a resource at 
a time is not limited to the structure of the underlying quorum system, 
but to the number that the problem definition allows. 



1 Introduction 

Group mutual exclusion is a generalization of mutual exclusion that 

allows a resource to be shared by processes of the same group, but requires 
processes of different groups to use the resource in a mutually exclusive style. 
As an application of the problem, assume that data is stored in a CD jukebox 
where only one disk can be loaded for access at a time. Then when a disk is 
loaded, users that need data on this disk can concurrently access the disk, while 
users that need a different disk have to wait until no one is using the currently 
loaded disk. Group mutual exclusion differs from I- exclusion jO] (which is also a 
generalization of mutual exclusion that allows at most I processes to be in the 
critical section simultaneously) in that in the latter the conflict in accessing a 
resource is due to the number of processes that attempt to access the resource, 
while in the former the conflict is due to the “type” of processes (i.e., the group 
they belong to). 
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Solutions for group mutual exclusion in shared memory models have been 
proposed in [Ihl2()l4lin| . Here we consider message passing networks. Two 
message-passing solutions for group mutual exclusion have been proposed in ini 
Both are extensions of Ricart and Agrawala’s algorithm for mutual exclusion EH- 
Basically, they work as follows: a process wishing to enter a critical section (i.e., 
to use a shared resource) broadcasts a request to all processes in the system, 
and enters the critical section when all processes have acknowledged its request. 
Since all processes are involved in determining whether a process can enter the 
critical section, the algorithms cannot tolerate any single process failure. 

In the literature, quorum systems have proven useful in coping with site 
failures and network partitions for both mutual exclusion (e.g., 

EOl), and ^-exclusion (e.g., In general, a quorum system (called a 

coterie El) consists of a set of quora, each of which is a set of processes. Quora 
are used to guard the critical section. To enter the critical section, a process 
must acquire a quorum; that is, to obtain permission from every member of the 
quorum. Suppose that a quorum member gives permission to only one process at 
a time. Then mutual exclusion can be guaranteed by requiring every two quora 
in a coterie to intersect, and /-exclusion can be guaranteed by requiring any 
collection of / -I- 1 quora to contain at least two intersecting quora. A quorum 
usually involves only a subset of the processes in the system. So even if processes 
may fail or become unreachable due to network partitions, some process may still 
be able to enter the critical section so long as not all quora are hit (a quorum is 
hit if some of its members fails). 

It is easy to see that coteries for /-exclusion cannot be used for group mutual 
exclusion because two conflicting processes may then both enter the critical sec- 
tion after obtaining permissions from members of two disjoint quora respectively. 
On the other hand, coteries for mutual exclusion can be used for group mutual 
exclusion, but it will result in a degenerate solution in which only one process 
can be in the critical section at a time. 



In this paper we present a quorum system, which we refer to as the surficial 
quorum system, for group mutual exclusion. To our knowledge, this is the first 
quorum system for group mutual exclusion to appear in the literature. The 
surficial quorum system is geometrically evident and so is easy to construct. It 
also has a nice structure based on which a truly distributed algorithm for group 
mutual exclusion can be obtained, and processes’ loads can be minimized. When 
used with Maekawa’s algorithm m, the surficial quorum system allows up to 

processes to access a resource simultaneously, where n is the total 
number of processes, and m is the total number of groups. In contrast, only one 
process is allowed to access a resource at a time if an ordinary quorum system 
is used. 

We also present a modification of Maekawa’s algorithm so that the number 
of processes that can access a resource at a time is not limited to the structure 
of the underlying quorum system, but to the number that the problem definition 
allows. Thus, the modified algorithm can also use ordinary quorum systems to 
solve group mutual exclusion. Nevertheless, when used with our surficial quorum 
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system, the message complexity of the modified algorithm is still a factor of 
m ^-i ) ) better than that used with an ordinary quorum system (of the 

same quorum size) . Another modification that trades offs synchronization delay 
for message complexity is also presented in the paper. 

The rest of the paper is organized as follows. Section 0 gives the problem 
definition for group mutual exclusion. Section 0 presents the surficial quorum 
system. Section 0 presents quorum-based algorithms for group mutual exclusion. 
Conclusions and future work are offered in Section 0 

2 The Group Mutual Exclusion Problem 

We consider a system of n asynchronous processes 1, . . . , n, each of which cycles 
through the following three states, with NCS being the initial state: 

— NCS\ the process is outside CS (the Critical Section), and does not wish to 
enter CS. 

— trying: the process wishes to enter CS, but has not yet entered CS. 

— CS: the process is in CS. 

The processes belong to m groups 1, . . . , m. To make the problem more general, 
we do not require groups to be disjoint. When a process may belong to more 
than one group, the process must identify a unique group to which it belongs 
when it wishes to enter CS. Since group membership is concerned only at the 
time a process wishes to enter CS (and at the time the process is in CS), when 
we say ‘process i belongs to group j’, we implicitly assume that process i has 
specified j as its group for entering CSlH 

The problem is to design an algorithm for the system satisfying the following 
requirements: 

mutual exclusiou: At any given time, no two processes of different groups are 
in CS simultaneously. 

lockout freedom: A process wishing to enter CS will eventually succeed. 

Moreover, to avoid degenerate solutions and unnecessary synchronization, we 
are looking for algorithms that can facilitate “coucurreut euteriug” , meaning 
that if a group g of processes wish to enter CS and no other group of processes 
are interested in entering CS, then the processes in group g can concurrently 
enter CS nm. 

3 A Quorum System for Group Mutual Exclusion 

In this section we present a quorum system for group mutual exclusion. Let 
P = {1, . . . , n} be a set of node^, which belong to m groups. An m-group 

^ The problem is described in a more anthropomorphous setting as Congenial Talking 
Philosophers in m- 

^ The terms processes and nodes will be used interchangeably throughout the paper. 
For a distinguishing purpose, however, we use “nodes” specifically to denote quorum 
members, and “processes” to denote group members. 
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quorum system 6 = (Ci,... ,Cm) over P consists of m sets, where each 
Ci C 2^ is a set of subsets of P satisfying the following conditions: 

intersection: V 1 < i, j < m,i ^ j,V Qi € Ci,\/ Q2 &Cj : Qi fl Q2 0 - 
minimality: VI < z < m,\/Qi,Q 2 € Ci,Qi ^ Q 2 '■ Qi % Q 2 - 

We call each Ci a cartel, and each Q G Ci a quorum. 

Intuitively, © can be used to solve group mutual exclusion as follow: each 
process i of group j, when attempting to enter CS, must acquire permission 
from every member in a quorum Q G Cj. Upon exiting CS, process z returns 
the permission to the members of the quorum. Suppose a quorum member gives 
permission to only one process at a time. Then, by the intersection property, no 
two processes of different groups can be in CS simultaneously. The minimality 
property is used rather to enhance efficiency. As is easy to see, if Qi C Q 2 , then 
a process that can obtain permission from every member of Q 2 can also obtain 
permission from every member of Qi. 

Recall that a quorum system over P for mutual exclusion is a set C C 2^ of 
quora satisfying the following requirements: 

intersection: VQi, Q2 G C* : Qi fl Q2 0 - 
minimality: VQi, Q2 G C, Qi yf Q2 : Qi 2 <52- 

To distinguish quorum systems for mutual exclusion from group quorum systems, 
we refer to the former as ordinary quorum systems. 

An ordinary quorum system C can be used as an m-group quorum system 
by a straightforward transformation Tm: 

^UC) = {C,... ,C). 

By the intersection property of C, all quora in a cartel of %rn{C) are pairwise 
intersected. Note that, in general, quora in the same cartel of a group quorum 
system need not intersect. 

We define the degree of a cartel C, denoted as deg(C), to be the maximum 
number of pairwise disjoint quora in C. The degree of a group quorum system 6 , 
denoted as deg( 6 ), is the minimum deg(C) among the cartels C in 6 . Clearly, 
if a node gives out its permission to at most one process at a time, then the 
number of processes (of the same group) that can be in CS simultaneously is 
limited to deg(C), where C is the cartel associated with the group. Moreover, a 
group quorum system of degree k immediately implies that every cartel contains 
at least an unhit quorum even if A: — 1 processes have failed. So high degree group 
quorum systems also provide a better protection against faults. 

In the following we present an m-group quorum system <5m = (Ci, . . . , Cm) 
with degree addition, the quora in the system satisfy the following 

four extra conditions: 

1. V 1 < z, j < m : I Ci I = I Cj I . 

2. VI < z, j < m,V Qi G Ci,V Q 2 G Cj : I Qi| = I Q 2 1- 




20 



Y.-J. Joung 



3. y p, q G P : I rip I = |riq|, where Up is the multiset {Q \ 3 l<i<m: QGCi 
and p € Q}, and similar for Uq. In other words, |rip| is the number of quora 
involving p. 

4. VI < i, j < ^ j, VQi e Ci, VQ2 S Cj : |Qi n Q2I = 1- 

Intuitively, the first condition ensures that each group has an equal chance in 
competing CS. The second condition ensures that the number of messages needed 
per entry to CS is independent of the quorum a process chooses. The third 
condition means that each node shares the same responsibility in the system. As 
argued by Maekawa m, these three conditions are desirable for an algorithm to 
be truly distributed. The last condition simply minimizes the number of nodes 
that must be common to any two quora of different cartels, thereby reducing the 
size of a quorum. 

Before presenting the detailed construction of &m, we first provide some 
intuitions. It is easy to see that a I-group quorum system ©1 satisfying the 
above conditions can be obtained as follows: 

61 = ({Mbe P}) 

The quorum system can be viewed as a line consisting of n nodes, each of which 
corresponds to a process in P, where n = |P|. Each quorum then consists of 
exactly one node on the line, and the collection of the quora constitutes the 
only cartel in the system. See Figure E top. By extending this line to a two- 
dimensional plane, we can obtain a 2-group quorum system 62 = (Ci, C2), where 
each quorum in C\ corresponds to the set of nodes in each row, while each quo- 
rum in C2 corresponds to the set of nodes in each column. By taking one step 
further, we can construct a 3-group quorum system 63 = (Ci, C2, C3) by arrang- 
ing nodes on the surface of a cube. Notice that a cube can be “wrapped up” by 
lines (strings) along three different dimensions. Lines along the same dimension 
are parallel to each other, while any two lines along different dimensions must 
intersect at two points. If we arrange the nodes on only three sides of the cube 
as shown in Figure 0 then every two lines along different dimensions intersect 
at exactly one point. 

We can unfold the three sides of the cube on the plane as shown on the left 
of Figure El Each quorum in C\ then corresponds to a vertical line across the 
first (the right most) column of squares. Each quorum in C2 corresponds to a 
horizontal line across the top square, and a vertical line across the left square on 
the bottom. Finally, each quorum in corresponds to a horizontal line across 
the two squares on the bottom. 

By appending another three squares to the bottom of the above pile of squares 
and extending the lines to these extra squares, we can construct ©4 as shown on 
the right of Figure El Each quorum in C\ corresponds to a vertical line across 
the first column of squares. Each quorum in C2 corresponds to a horizontal line 
across the square on the first level (starting from the top), and then a vertical 
line across the second column of squares. Each quorum in corresponds to a 
horizontal line across the squares on the second level, and then a vertical line 
across the third column of squares. Finally, each quorum in (74 corresponds to 
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Fig. 1. A surficial quorum systems. 





Fig. 2. The unfolded surficial quorum system for ©3 (left), and ©4 (right). 



a horizontal line across the squares on the third level. Notice that on the right 
staircase of Figure 0 the first level of squares constitutes ©2, and the first two 
levels of squares constitutes 63. 

This procedure can be extended to &m- In general, each Ci in ©^ needs 
m — 1 squares, each of which is to be shared with one of the other m — 1 cartels 
so that the corresponding lines of the two cartels intersect at exactly one node 
on the square. Overall, there are m{m — 1) /2 squares. Let k be the width of each 
square (i.e., the number of quora in each cartel). Then each square consists of 
nodes. So the total number of nodes on the m{m — l)/2 squares is k^m{m—l) /2. 
A simple way to map nodes on the squares to the processes in P is to let each 
node correspond to a unique process. In this case k^m{m — l)/2 = n, where 
n=\P\. So 



k = 



2n 



m{m — 1) ’ 



m > 1. 



So the quorum size is 



(m - 1) . t 

V m 

and each cartel consists of quora. 

In the following we present the “staircase” construction of the surficial quo- 
rum system. 
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Pk.l ■ 
Pk,2 ■ 


• P2,l Pl,l 
. P2,2 Pi, 2 


Pk.k • 


■ P2.k Pl.k 



Fig. 3. Arrangement of nodes in Gm- On the left is the indices (superscripts) of squares, 
and on the right is the indices (subscripts) of nodes in each square. 



Inputs. P C N, n,m G N, where n= |P|. 

Outputs. An TO-group quorum system &m = {Ci, . . . , Cm) over P. 
Assumption, m > 1 and m(^-C ~ ^ some integer k. 

1. Partition P into subsets, each of which consists of nodes. Let 

denote these subsets, — 1, — 1. 

For each P'-’^ , let denote the nodes in the set, where 1 < r, s < fc. (See 
Figure 0) 

2. For each cartel Ci in &m, denote the quora in the cartel by Qij, 1 < i < m, 
1 < J < Then, Qi^ is defined by 

Qi,j = {Pr’,j~^\^ < s < i — 1,1 C: r < k }[J <s<m — l,l<r<fc} 

Note that when i = 1, the first set in the formula is empty, and when i = m, 
the second part is empty. 



Remarks 

Some comments on the surficial quorum system are given in order. First, the 
degree of &m is \J m(m-i) - terms of degree, the construction is not optimal, 

as we can construct an m-group quorum system with degree ^/n Hg. It can 
be seen that ^/n is the upper bound as no m-group quorum system over a set 
of n nodes can have degree greater than y/n for any m > 1. (This is because 
every quorum in an m-group quorum system of degree k must contain at least k 
elements. So every cartel must involve at least k-k < n different nodes.) However, 
to reach such degree, n must be equal to some where p is a prime and c is 
a positive integer. Besides, the construction is difficult to visualize as it works 
on an affine plane of order p°. In contrast, the surficial quorum system is easy 
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8 6 1 



9 4 2 
7 5 3 



591 951 

726 762 



Cl = {{1,2,3}, {4,5,6}, {7,8,9}} 
C 2 = {{1,6,8}, {2,4,9}. {3,5,7}} 
C 3 = {{3.5.9}. {2,6,7}, {3.4.8}} 
C 4 = {{1.4. 7}, {2,5,8}. {3,6,9}} 



& — (Cl, C 2 , C 3 , C 4 ) where 



3 4 8 



8 4 3 



471 741 741 
825 528 852 
369 396 963 



Fig. 4. A mapping between processes and nodes and the resulting 4-group quorum 
system. 

to visualize and so is easy to construct. Moreover, the surficial quorum system 
minimizes processes’ loads by letting every node be included in at most two 
quora. Under this condition, the maximum degree an m-group quorum system 
can achieve is which is exactly what 6 „i has achieved^ 

Second, in the construction we have chosen the mapping between nodes and 
processes to be one-to-one. The advantage of this mapping is that it is simple 
and geometrically evident. The quorum size may be reduced (and therefore the 
increase of deg ( 6 „i)) by choosing a more sophisticated mapping to allow nodes to 
map to the same process. For example, the mapping shown in Figure0for m = 4 
and n = 9 results in a 4-group quorum system that is optimal in degree (i.e., 
with maximum possible degree) O However, finding a mapping that optimizes 
the degree for arbitrary n and m is considerably more difficult and remains a 
challenging future work. 

Third, any nonempty mapping ip from the nodes to the processes results in an 
m-group quorum system, although it may not satisfy the four extra conditions 
discussed at the beginning of this section. For example, the mapping ) = q 
for any fixed q G P, 1 < i < m — l^i < j < m — 1,1 < r, s < k results 
in a singleton-based m-group quorum system. Moreover, assume instead that 
k = y/n. Let the processes be arranged as a grid so that P = 1 1 < r, s < fc}. 

® To see this, let £ = (Ci, . . . , Cm) be a group quorum system of degree k over an 
n-set. Let Pp^l be the intersection of the rth quorum in Ci and the sth quorum in 
Cj, i ^ j. By definition, ^ 0. Since every node is included in at most two quora, 
Plli n P^i’ii = 0 if {i,j} 7 ^ (i.e., the two unordered pairs are not equal) or 

(r, s) 7 ^ (r' , s') (i.e., the two ordered pairs are not equal). Given that 1 < i ^ j < m 
and that each cartel contains at least k quora, there are (’^) unordered pairs of i,j, 
and for each of them, at least k^ ordered pairs of (r, s) that can constitute a P)]i . 



squares. Then, any vertical line of three nodes constitutes a quorum, and the three 
vertical lines on the square constitute a cartel. Similarly, the three horizontal lines 
constitute a second cartel. The other two cartels correspond to the three “rounded” 
lines with slope 1 and —1, respectively. For example, the three “rounded” lines with 
slope 1 on the top square are; (3, 6 , 9}, (5, 2, 8 }, and {7, 4, 1}. 
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Then the mapping = qr,s results in a quorum system £ = (Ci, 

such that each quorum in C\ corresponds to a column in the grid, each quorum 
in Cm corresponds to a row in the grid, and each quorum in every other cartel Ci 
corresponds to a row and a column^ By experimenting different mappings and 
by adjusting the dimension of the squares (i.e., by replacing the squares with 
rectangles) , we can fine-tune the quorum system to fit into different applications 
in which the demand of the shared resource by different groups of processes is 
off-balanced. 

Finally, it is easy to see that &m is in general “dominated.” Dominance 
between group quorum systems is defined as follows m 

Definition 1. Let € = (Ci,... ,Cm) and D = (£>i,... ,Dm) be two m-group 
quorum systems over P. Then £ dominates J) if 

1 . 

2. y Q £ Di, 1 < i < m,3 R £ Ci : R <£ Q. 



S is nondominated if there is no £ such that £ dominates 2D . 

To illustrate dominance, the group quorum system £ = 

({{1, 2}, {3,4}}, {{1, 3}, {2,4}}) over P = {1,2, 3, 4} is dominated by 
D = ({{1, 2}, {3,4}, {1, 4}, {2, 3}}, {{1, 3}, {2,4}}). By a simple enumeration, 
it can be seen than S) is nondominated. 

Intuitively, a group quorum system C dominates D if whenever a quorum of 
a cartel in D can survive some failures, then some quorum in the corresponding 
cartel of C can certainly survive as well. Thus, C is said to be superior to D 
because C provides more protection against failures. 

“Fully distributedness” and “nondominance” appear to be two conflicting 
notions, as for example, the “fully distributed” ordinary quorum system pro- 
posed by Maekawa [2E| is also dominated PHESI. However, the construction of 
Sm = (Cl) • ■ • ) Cm) is such that every quorum Q in the cartels is a minimal set 
that intersects every other quorum in a different cartel. As proved in m, this 
property implies that &m can be extended to a nondominated group quorum 
system il = {Di, . . . , Dm) such that Ci C Di for all 1 < i < m. In other words, 
we can build upon &m a nondominated group quorum system il such that il 
“contains” &m- An important meaning of this “containing” relation is that: &m 
can be used to realize a truly distributed algorithm when failures do not occur, 
while il can be used to “backup” &m when failures do occur to increase fault 
tolerance. 



® Note that because in the construction a horizontal line of nodes is “connected” by 
a unique vertical line of nodes, not any union of a row and a column in the grid 
represents a quorum in Ci. However, by relaxing the “connection” condition, we can 
obtain a Ci that is composed of any union of a row and a column in the grid. Note 
further that Ci and Cm have degree ^/n, while the other cartels have degree 1. 
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4 Quorum-Based Algorithms for Group Mutual 
Exclusion 

In this section we present two algorithms that use quorum systems to solve group 
mutual exclusion. The network is assumed to be complete and the communica- 
tion channel between each pair of processes is reliable and FIFO. We begin with 
Maekawa’s algorithm m- Then we discuss some variations involving trade-offs 
between concurrency and message complexity, and between concurrency and 
synchronization delay. 



4.1 The Basic Framework 

Most quorum-based algorithms for mutual exclusion are based on Maekawa’s 
algorithm m, which works as follows: 

1. A process i wishing to enter CS sends a request message to each member of 
a quorum Q, and enters CS only after it has received a permission message 
from every member of Q. 

Upon leaving CS, a process sends a release message to every member of Q 
to release its permission. 

2. A node gives away one permission at a time. So upon receipt of a request by 
i, a node j checks if it has given away its permission. 

a) If j has given a permission message to another process k, and has not yet 
received a release message from k, then some arbitration mechanism is 
used to determine whether to let i wait, or to let i preempt k’s possession 
of the permission. 

b) Otherwise, j sends a permission message to i. 

In general, request messages by process i to the members of Q are sent 
simultaneously so as to minimize the synchronization delay, which is the delay 
from the time a process invokes a mutual exclusion request until the time it 
enters CS. The delay is measured in terms of message transmission time. It is 
clear that the minimum synchronization delay is 2 if request messages are sent 
simultaneously. However, due to the asynchrony of the system, a process may 
hold a permission while waiting for another process to release a permission. This 
in turn may incur a deadlock. 

Maekawa’s algorithm handles deadlocks by requiring low-priority processes to 
yield permissions to high-priority processes. Priorities are usually implemented 
by Lamport’s logical timestamps ESI- The smaller the timestamp of a request, 
the higher the priority of the request. Specifically, if a node i receives a request 
by j after giving a permission to k and j's priority is higher than k's, then i issues 
an inquiry message to k. Process k then releases its permission to i (by sending a 
release message) if it determines that it cannot successfully acquire permissions 
from all members of the quorum it chooses. After receiving the release message, 
node i gives its permission to j by sending j a grant message. When j exits CS 
and releases i's permission, i then returns the permission to k (presumably no 
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other process with a priority higher than fc’s is also waiting for i’s permission). 
So Maekawa’s algorithm needs 3c to 6c messages per entry to CS, where c is the 
(maximum) size of a quorum. 

Maekawa’s algorithm works straightforwardly for group mutual exclusion: 
Let £ = (Cl, ... , Cm) be an m-group quorum system over P. A process i G P 
that wishes to enter CS as a member of group j chooses a quorum from the cartel 
Cj, and enters CS only when it has obtained a permission from every member of 
the quorum. By the mutual exclusion property of £ and by the conflict resolution 
scheme used in the algorithms, mutual exclusion and lockout freedom are easily 
guaranteed. 

4.2 A Tradeoff between Concurrency and Message Complexity 

In Maekawa’s algorithm, since a node gives permission to only one process at 
a time, the maximum number of processes of a group that can be in CS si- 
multaneously is limited to the degree of the cartel associated with the group. 
So no concurrency is offered using group quorum systems Tm(C) derived from 
ordinary quorum systems C (which have degree one). 

For group quorum systems with degree greater than one, they may still not 
be able to offer a satisfactory degree of concurrency. This is because the size of a 
group can be greater than \J\P\. However, as discussed in Section 0 no m-group 
quorum system over P can have degree more than -^|P|, unless m = 1. 

To overcome this, we modify Maekawa’s algorithm so that a node can give 
permission to more than one process. So even if quora in the same cartel may 
intersect, two or more processes may still enter CS simultaneously. The rule for 
a node k to determine whether to grant i’s request for permission is as follows: 

k grants i’s request so long as there is no conflict — i.e., no other process 
of a different group is also requesting/possessing k’s permission. Other- 
wise, conflicts are resolved as follows: a process i of group g yields k’s 
permission to another process j of group h if j has priority higher than 
all processes of group g that currently request /possess j’s permissions. 

Note that because of the rule, requests are not processed strictly in First-Come- 
First-Served (FCFS) order. In particular, a node k may grant i’s request even if 
some other j of a different group is already waiting for k’s permission (regardless 
of whether j’s priority is higher than i or not.) The reason for this is to increase 
system performance. We shall explain this in more detail in the following section. 

We refer to the algorithm as Maekawa_M. Because the algorithm is fairly easy 
to design, we omit the detailed code of the algorithm in this extended abstract. 

4.3 A Tradeoff between Concurrency and Synchronization Delay 

In the above algorithm, after a node i has given permission to k processes, i 
may have to withdraw its permission if it receives a higher priority request from 
a different group. Withdrawing a permission from a process results in three 
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messages: an inquiry message, a relinquish message, and eventually the return 
of the permission to the process. So in the worst case, a request to i may generate 
3fc messages. So the maximum number of messages per entry to CS is 3c -I- 3c -r, 
where c is the quorum size, and r is the maximum number of permissions a node 
may give away at a time. To facilitate a maximum concurrency while minimizing 
message complexity, for each cartel C, r should be limited to s/deg(C), where 
s is the size of the group that uses G. In this case, an entry to CS requires at 
most 3c -|- 3c • s/deg((7). Given that s is determined by the problem definition, 
the message complexity is roughly in proportion to c/deg(C). So the higher the 
degree of the underlying group quorum system, the lower the message complexity. 

If message complexity needs to be bounded in 0(c), deadlocks must be re- 
solved in a different way. A well-known technique in resource allocation is to let 
each process request permissions from quorum members in some fixed order 
say, with increasing node IDs. So if a process i is waiting for j’s permission, then 
every permission i holds must be from some / such that f < j ■ Moreover, every 
process k that currently holds j’s permission has either obtained permissions 
from all members of its quorum, or is waiting for a permission from some h such 
that h > j. So deadlocks are not possible because there is no circular waiting. 

Note that the above deadlock free argument does not depend on how many 
permissions a node may give out at a time. That is, a node can still give out 
multiple permissions. The number of messages required per entry to CS is 3c, 
and the minimum synchronization delay is 2c. The message complexity and the 
minimum synchronization delay can be reduced further to 2c -I- 1 and c -I- 1, 
respectively, by letting quorum members circulate request messages. The com- 
plete code of the algorithm is given in Figures 0 and 0 We refer to the algorithm 
as Maekawa_S. For ease of understanding, the algorithm is presented as two 
CSP-like repetitive commands consisting of guarded commands m Figure 0 
describes the behavior of a process that acts as a group member, and Figure 0 
describes the behavior of a node that acts as a quorum member. If one wishes, 
the two repetitive commands can be combined into a single one. 

Although we have assumed FIFO communication channels, in the algorithm 
a node j may receive a process i’s request before it receives i’s release message 
for i’s previous request (lines D.9-10). This occurs because request messages hop 
through quorum members. So when i issues a release message msg^ to j and then 
issues a new request msg 2 , msg 2 may arrive at j (indirectly through members 
of a different quorum) before msg^ does. To simplify the algorithm, j defers the 
process of msg 2 until it has received msgi (lines E. 16-18). 

Like Maekawa_M, requests are not processed strictly in FCFS order. Instead, 
when a node j receives a request from a process i of group g, if j has no out- 
standing permission, then j sends i a permission and chooses i as a reference 
(line D.7). A reference process is used such that subsequent requests from the 
same group are also granted so long as the reference remains in CS. When a 
reference process leaves CS, if no other process of a different group is waiting 
for j’s permission, then a new reference is chosen from those processes that cur- 
rently hold j’s permissions (lines E.3-6). Otherwise, the reference is reset to T, 
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A.l *[wish to enter CS as a member of group g 
A. 2 state := trying-, 

A. 3 randomly select a quorum Q from Cg-, 

A. 4 send REQUEST(i, g, Q) to /irst(Q); 

B. l □ receive GRANT(j) — > 

B. 2 state := CS\ /* acquires quorum Q * j 

C. l □ exit CS — > 

C.2 state := NCS\ 

C.3 for j G Q do send RELEASE(i) to j; 

C.4 ] 



Variables: 

— state: the state of process i. 

— Q: the quorum i selects. We assume the following three functions on quora: 

• first (Q): return the smallest id in Q. 

• next(k, Q): return the smallest id in Q that is greater than k. 

• last(Q): return the largest id in Q. 

— Cg-. the cartel associated with group g. 

Messages: 

— REQUEST(i, g,Q): a request by i to obtain the recipient’s permission to enter CS 
as a member of group g. Q is the (id of the) quorum i chooses. 

— CRANT(j): a permission given by node j to the recipient. In particular, j is the 
last node in Q. 

— RELEASE(i): a message by i to release the recipient’s permission. 



Fig. 5. Algorithm Maekawa_S executed by process i. 



meaning that the “door” to CS for the group is closed to yield the opportunity 
to another group. 

There are two reasons for choosing this “entry policy”. First, by the mutual 
exclusion property, while some reference process i is in CS, no other group of 
processes can be in CS. So maximal resource utilization can be achieved by 
allowing more processes of the same group to share CS with i, regardless of 
whether some other group of processes are waiting for CS or not. Furthermore, 
because while i is in CS, some fast process may enter and exit CS any number of 
times, the algorithm facilitates an unbounded degree of concurrency Note 
that lockout freedom can still be guaranteed because a reference process will 
eventually leave CS and close the “door” to CS for its group. 

The other reason is to minimize the number of “context switches” mi. (A 
context switch occurs when the next entry to CS is by a different group of 
process.) As analyzed in 1 1 Yj . in group mutual exclusion requests to CS cannot 
be processed in a strictly FCFS order, or else the system could degenerate to 
the case in which nearly only one process can be in CS at a time when m is 
large. As can be seen, this phenomenon will also appear in both Maekawa_M 
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D.l *[ receive REQUEST(z, . 9 , Q) — >■ 

D.2 if grant.ps = 0 V {g = current-group /\ reference 7 ^ _L /\ i ^ grant.ps) then [ 

/* grant the request */ 

D.3 if j = last(Q) then send GRANT(j) to i; 

D.4 else send REQUEST(i, 5 , Q) to next{j, Q); 

D.5 if grant-ps = 0 then [ 

D.6 current-group := g; 

D.7 reference := i: ] 

D .8 grant-ps := grantjps + {z}; ] 

D.9 else if i G grant-ps z’s request arrives before its previous release message 
D.IO early -requests := early -requests 

D. ll else request-ps := request-ps ] 

E. l □ receive RELEASE(i) — y 

E.2 grant-ps := grant-ps — {i}; 

E.3 if reference = i then 

E.4 if grant-ps 7 ^ 0 A request-ps = 0 then /* choose a new reference */ 

E.5 reference := k for some arbitrary k G grant-ps; 

E .6 else reference := J_; 

E.7 if grant-ps = 0 A cequest-ps 7 ^ 0 then [ 

/* grant the earliest request in the queue, as well as the requests from 
the same group of processes */ 

E .8 {k,h,R) := first(request..ps)\ 

E.9 reference := k: 

E.IO current -group := A; 

E.il for {k\h\ R') G request -ps, h = h', do [ 

E.12 if j = last{Q) then send GRANT(j) to k'; 

E.13 else send REQUEST(fc', h',R') to next{j^ R')] 

E.14 request-ps := request-ps — { /?/, i?"*) }; 

E.15 grant-ps := grant-ps ] 

E.16 if (i,g,S) G early -requests for some g and 5, then [ 

/* process z’s early request */ 

E.17 send REQUEST(i, g, S) to j; 

E.18 early -requests := early -requests — ] 

E.19 ] 

Variables; 

— request-ps: queue of requests received by j. The requests are represented as tuples 
ih9-Q)i where i is the requester, g is the group of the requester, and Q is the 
quorum i chooses. The queue is initialized to 0. Requests in the queue are ordered 
by the time they are inserted into the queue. Function first {request -ps) returns the 
earliest request in the queue. 

— early -requests: queue of “early” requests received by j. A request {i-g^Q) is said 
“early” and is temporarily stored in early -requests if i has released y’s permission 
before it issued the request, but the request arrives at j before the release message. 

— grant-ps: set of processes to which j has given a permission. It is initialized to 0. 

— current -group: the group of the processes to which j has given permissions. It is 
initialized to A. 

— reference: a reference process used to determine whether subsequent processes of 
current-group can enter CS. 



Fig. 6. Algorithm Maekawa_S executed by node j. 
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and Maekawa_S. So in both algorithms we allow some late requests to “jump 
over” existing requests to allow more processes to concurrently enter CS. Due 
to the space limitation, more detailed performance analysis will be presented in 
the full paper. 



4.4 Correctness of the Algorithms 



Maekawa’s algorithm has been well studied in [20132133101/] . It is easy to see that 
the algorithm can be directly adapted to group mutual exclusion using group 
quorum systems. For the two modified algorithms MaekawaJVI and Maekawa_S 
we have presented, their correctness can also be easily seen. So to save space, we 
will leave the formal analysis in the full paper. 



5 Conclusions and Future Work 

We have presented a quorum system, the surficial quorum system, for group 
mutual exclusion. The surficial quorum system generalizes existing quorum sys- 
tems for mutual exclusion in that quora for processes of the same group need 
not intersect with one another. This generalization allows processes to acquire 
quora simultaneously, and so allows them to enter critical section concurrently. 
The surficial quorum system has a very simple geometrical structure, based on 
which a truly distributed algorithm for group mutual exclusion can be obtained, 
and based on which processes’ loads can be minimized. 

The surficial quorum system has degree -sj where n is the total num- 
ber of processes, and m is the total number of groups. So when used with 
Maekawa’s algorithm, it allows a maximum of processes to be in the 

critical section simultaneously. The message complexity per entry to the critical 
section is 6 • Furthermore, it can tolerate up to — 1 process 

failures. For comparison, the two message-passing algorithms RAl and RA2 pre- 
sented in \n\ have message complexity 2n and 3n, respectively, and both allow 
all group members to be in the critical section simultaneously. However, they 
cannot tolerate any single process failure. In terms of minimum synchronization 
delay, all three algorithms have the same measure — 2. 

As we have noted earlier, the degree of a group quorum system is bounded 
by y/n. So Maekawa’s algorithm must be generalized if group size is greater than 
^/n and we wish to allow all group members to be in the critical section simulta- 
neously. Two generalizations Maekawa_M and Maekawa_S were presented in the 
paper. Both allow all group members to be in the critical section simultaneously, 
regardless of the degree of the underlying quorum systems. Maekawa_M preserves 
Maekawa’s minimum synchronization delay, but needs 3c -I- 3c • s/d messages per 
entry to the critical section (when a node gives away bounded number of per- 
missions), where c is the quorum size, s is the group size, and d is the degree 
of the underlying group quorum system. So when used with the surficial quo- 
rum system, the message complexity is 0(n ■ min{m,n}). The other algorithm 
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Maekawa_S trades off minimum synchronization delay for message complexity. It 
reduces the message complexity to 2c -I- 1, but needs a minimum synchronization 
delay of c -I- 1. 

There is a considerable literature on quorum systems. Many structures have 
been explored to construct ordinary quorum systems, including finite projec- 
tive planes m, weighted voting grids , trees 0, wheels m, 

walls EDI, and planar graphs 0. The structure we used in the surficial quorum 
system can be viewed as a “multidimensional” grid. For future work, it is in- 
teresting to investigate how the other structures can be used for group quorum 
systems. 
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Abstract. We provide effective (i.e., recursive) characterizations of the relations 
that can be computed on networks where all processors use the same algorithm, 
start from the same state, and know at least a bound on the network size. Three 
activation models are considered (synchronous, asynchronous, interleaved). 



1 Introduction 

The question concerning which problems can be solved by a distributed system when 
all processors use the same algorithm and start from the same state has a long story: 
it was firstly formulated by Angluin who investigated the problem of estab- 

lishing a “center”. She was the hrst to realize the connection with the theory of graph 
coverings, which was going to provide, in particular with the work of Yamashita and 
Kameda [{YK961 . several characterization for problems which are solvable under cer- 
tain topological constraints. Further investigation led to the classification of computable 
functions lY K96 Y K98 ASW88 Nnr99 . and allowed to eliminate several restrictions 
(such as bidirectionality, distinguished links, synchronicity, . . . ) IIDKMP9?lRCG+96l 
IBV97al . 

Few years ago, while lecturing at the Weizmann Institute about possibility and im- 
possibility results for function computation IIBV97all . we were asked by Moni Naor 
whether our results could be extended to the computation of arbitrary relations (as op- 
posed to the indecidability results of INS951 ). Of course, this is the most general case of 
a distributed task: all classical problems such as election, topology reconstruction etc., 
can be seen as the computation of a specihc relation. 

The present paper answers positively to this question. We provide a proof technique 
that allows to show whether an anonymous algorithm that computes a given relation 
exists under a wide range of models. Moreover, the proof technique is effective: if the 
class of networks under examination is hnite, and the relation to be computed is hnite, 
then the technique turns into a recursive procedure that provides either an anonymous 
algorithm computing the relation, or a proof of impossibility. 

Of course, results about specihc relations (such as the one dehnining the election 
problem or topology reconstruction) are already known, at least for certain models. 
Here we complete the picture by providing a characterization of the relations that can be 
computed on a wide range of models when (as usually assumed) a bound on the network 
size is known. (If such a bound is not known, a completely different approach is needed, 
as shown in IBV99I .) 
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Our results are mainly of theoretical interest, because of the large amount of infor- 
mation exchanged by the processors. Moreover, complexity issues are not taken into 
consideration, as opposed, for instance, to IASW88' ANIM961 . We rather concentrate 
on general decidability properties that hold under any assumption of knowledge or of 
communication primitives (broadcast, point-to-point, etc.). 

Our networks are just directed graphs coloured on their arcs (information such as 
processor identity, communication models etc. can be easily encoded on the arc colour- 
ing IfROG+Qhl t. and each processor changes its state depending on its previous state and 
on the state of its in-neighbours. The problem specification is given by a certain relation 
between the inputs and the outputs of the processors, and we allow the relation to depend 
on the network, so that problems like topology reconstruction can be specihed. 

We consider three known activation models: asynchronous (at each step, we can 
activate any subset of enabled processors), synchronous (all processors) and interleaved 
(a.k.a. central daemon — exactly one processor), and we give characterization theorems 
based on the notion of graph fibration, a weakening of the notion of covering used in 
Angluin’s paper that also subsumes the concept of similarity of processors introduced 
in IIJS85I . Moreover, the characterizations are stated in a completely uniform way across 
the activation models — the only change is in the family of hbrations considered. 

We remark that the synchronous model is computationally equivalent to asyn- 
chronous FIFO networks with finite time delivery (which were the original reason behind 
the study of the model 1YK96I ). 

The motivation for the study of impossibility results in anonymous networks stems 
also from questions about self-stabilizing systems: as already noted in ISRR951 . and 
exploited in IBV97bll . an impossibility result for anonymous networks gives an impossi- 
bility result for a uniform self- stabilizing system (as one of the choices of the adversary 
is setting all processors to the same state). 

We begin by introducing a simple hnite class of networks that is used to exemplify 
our (fairly abstract) characterization theorems. Then, we give our main definitions, recall 
the standard notion of view and introduce the main mathematical ingredient of our 
proofs — graph hbrations. Finally, we prove our characterization theorem and give some 
applications. For other examples of applications, the reader should be able to apply 
easily our characterization theorems to classical problems such as election, topology 
reconstruction or function computation, getting back the results of our previous papers. 

2 A Guiding Example 

As a guiding example, we formulate a problem that, to our knowledge, has not been 
previosly addressed in the literature about anonymous networks. We are interested in 
determining whether there exist an anonymous algorithm solving the majority problem 
on a certain class of networks. All processors are given a boolean value (0 or 1) as input, 
and eventually must choose an output, the same for all processors, that corresponds to 
the majority of input values (in case of a tie, either value is correct, provided that all 
processors agree on that value). The algorithm must work on every network of the class 
under examination (informally speaking: processors just know that they live in one of the 
networks depicted, but they know neither which one exactly, nor which is their position 
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in the network). We shall use the class depicted in FigureQ]as a case study. For simplicity, 
in the example we restrict to the synchronous case. 




Fig. 1. Can you compute majority here? 



3 Basic Definitions 

3.1 Graph-Theoretical Definitions 

A (directed) (multi)graph G is given hy a nonempty set Nq = {1,2,..., no } of nodes 
and a set Aq of arcs, and by two functions sc, to ■ Nq that specify the source 

and the target of each arc. An (arc-)coloured graph (with set of colours C) is a graph 
endowed with a colouring function 7 : Ac C. We write i j when the arc a has 
source i and target j, and i ^ j when i ^ j for some a G Ac- Subscripts will be 
dropped whenever no confusion is possible. 

A (in-directed) tree is a grapfQ with a selected node, the root, and such that any other 
node has exactly one directed path to the root. 

3.2 The Model 

In the following, by a network we shall always mean a strongly connected coloured 
graplfl. The nodes of such a graph are called processors. 

Computations of a network are defined by a state space and a transition function 
specifying how a processor must change its state when it is activated. The new state 
must depend, of course, on the previous state, and on the states of the in-neighbours; 
the latter are marked with the colour of the corresponding incoming arc (thus, e.g., if 
all incoming arcs have different colours a processor can distinguish its in-neighbours). 
For sake of simplicity, though, we discuss the case with just one colour, so that arcs and 
nodes are in fact not coloured; the introduction of more colours makes no significant 
conceptual difference. 

We also need a function mapping input values to initial states and final states to 
output values. Formally, a protocol P is given by a set X of local states, a distinguished 
subset F C X of final states, an input set T, an input function in : T — AT, an output 

' Since we need to manage infinite trees too, we assume that the node set of a tree can be N. 

^ Colours on the nodes act as identifiers, whereas colours on the arcs define the communication 
model. Choosing a suitable colouring we can encode properti es such as the presence of unique 
identifiers, distinguished outgoing/incoming links and so on bCG+96ll . 
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set 17, an output functioifl out : F ^ f2 and a transition function S : X x X 

satisfying the constraint that for every x G F, 6{x, —) = a;; here X® is the set of finite 
multisets over X. 

Intuitively, the new state of a processor depends on its previous state (the first com- 
ponent of the cartesian product) and on the states of its in-neighbours (we must use 
multisets, as more than one in-neighbour might be in the same state). The additional 
condition simply states that when a processor enters a final state, it cannot change its 
own state thereafter, whatever input it receives from its in-neighbours. 

A global state (for n processors, with respect to the protocol P) is a vector x G X^. 
Such a state final if Xi G F for every i. 

Given a network G and a global state x, we say that processor i is enabled in x iff 
an application of the transition function would change its state. Of course, no processor 
is enabled in a final state. 

An activation for G in state a; is a nonempty set of enabled processors; the activation 
is called 

- synchronous if it contains every enabled processor; 

- interleaved if it contains but a single enabled processor. 

When we need to emphasize that no constraint is required on the set of activated pro- 
cessors, we use the term asynchronous. 

A computation of the protocol P on the network G (with n processors) with input 
V G T" is given by a (possibly infinite) sequence of global states x'^,x^,. . . , x ^{, . . . ) 
and a sequence of sets of processors . . . ) (called the activation 

sequence) such that: 

1. x^ = in(r^); 

2. for all t, A* is an activation for G in state a:*, and is obtained applying 5 to all 

processors of A* (the other processors do not change their state). 

A computation is called synchronous, interleaved or asynchronous if every activation 
is such. It is terminating if it is finite and its last state is final; in this case, its output is 
defined as the vector out (x^) . 

Given a class of networks C, a (C-indexed) relation i? is a function associating to 
each G S C a set Rg^ T’" x 17”, where n is the number of processors of G. Relations 
are used to define problems: for instance, if T = {*} and Q = {0, 1}, the relation 
containing all pairs (* * • • • *, 6162 • • • hn), where exactly one of the fe^’s is nonzero, 
defines the well-known anonymous election problem. Function computations, topology 
reconstruction etc. can be easily described in a similar way. Note that classes specify 
knowledge: the larger the class, the smaller the knowledge (the class of all networks 
corresponds to no knowledge at all; a singleton to maximum knowledge). 

Let C be a class of networks, i? be a relation and P a protocol. We say that P 
computes (in a synchronous, interleaved, asynchronous manner resp.) i? on C if for 
all G G C with n processors it happens that for all v G dom(i?G), every maximal 
(synchronous, interleaved, asynchronous resp.) computation of P on G with input v is 
terminating, and its output uj is such that v Rq cj. 

^ The input and output functions will be silently extended componentwise to vectors. 
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3.3 The View Construction 

A classical tool in the study of anonymous networks is the concept of view, introduced 
for bidirectional networks in 1YK96I and extended to the directed case in llRCG~*~9fill (to 
be true, the concept of view is already present in the seminal paper Iran , partially 
disguised under the mathematical notion of graph covering). The view of a processor 
is a tree that gathers all the topological information that the processor can obtain by 
exchanging information with its neighbours. 

More formally, the view of a processor i in the network G is an in-tree G* built as 
follows: 

- the nodes of G® are the (hnite) paths of G ending in i, the root of G® being the empty 
path; 

- there is an arc from the node tt to the node tt' if tt is obtained by adding an arc a at 
the beginning of tt' . 

The tree G® is always infinite if G is strongly connected and has at least one arc, and 
there is a trivial anonymous protocol that allows each processor to compute its own view 
truncated at any desired depth. At step fc + 1 of the protocol, each processor gathers from 
its in-neighbours their views truncated at depth k, and combining them it can compute 
its own view truncated at depth k + 1. The start state is the one-node tree. An example 
of view construction is given in Figure El where we show the first four steps of view 
construction for the processors of a simple network. 




Fig. 2. An example of view construction 



Clearly, in the view construction process, inputs must be taken into account. That is, 
the views are really coloured on their nodes by elements on the input set. It is possible 
that two processor have the same view with a certain choice of inputs, but have a different 
view with another. For instance, if all processors have different inputs, then all processors 
get different views, even in the case of a very symmetric topology (e.g., a ring). 
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The reason why views are important is that the state of a processor after k steps of 
any anonymous computation may only depend on its view truncated at depth k. Thus, it 
is of crucial importance to determine which processors of a network possess the same 
(infinite) view. 



3.4 Graph Fibrations and the Lifting Lemma 

The problems stated at the end of the last section can be solved very elegantly using an 
elementary graph-theoretical concept, that of graph fibration [‘BVI - A fibration formal- 
izes the idea that processors that are connected to processors behaving in the same way 
will behave alike; it generalizes both the usage of graph coverings in Angluin’s original 
paper [ |Ang8t)| | and the concept of similarity of processors introduced in H.ISK51 . 

Recall that a graph morphism f : G ^ H is given by a pair of functions /jv : Nq — 
Nh and /a ■ Aq — Ah that commute with the source and target functions, that is, 
sh o Ja = fN o Sc and tn ° /a = fN ota- (The subscripts will usually be dropped.) In 
other words, a morphism maps nodes to nodes and arcs to arcs in such a way to preserve 
the incidence relation. If colours are present (on the arcs or on the nodes) they must be 
preserved. 

Definition 1 A hbration between ( coloured) graphs G and B is a morphism ip : G ^ B 
such that for each arc a € Ab and for each node i € Nq satisfying ip(i) = t(a) there 
is a unique arc a® £ Aq (called the lifting of a at i) such that piff) = a and t{ff) = i. 

If p : G ^ B is a. fibration, B is called the base of the fibration. We shall also say that 
G is fibred (over B). The fibre over a node i £ Nb is the set of nodes of G that are 
mapped to i, and will be denoted by p~^(i). 

There is a very intuitive characterization of hbrations based on the concept of local 
isomorphism. A fibration p : G ^ B induces an equivalence relation between the nodes 
of G, whose classes are precisely the hbres of p. When two nodes i and j are equivalent 
(i.e., they are in the same fibre), there is a bijective correspondence between arcs coming 
into i and arcs coming into j such that the sources of any two related arcs are equivalent. 

In Figure El we sketched a fibration between two graphs. Note that, because of the 
lifting property described in Dehnition Q all black nodes have exactly two incoming 
arcs, one (the dotted arc) going out of a white node, and one (the continuous arc) going 
out of a grey node. In other words, the in-neighbour structure of all black nodes is the 
same. 

The main raison d’etre of hbrations is that they allow to relate the behaviour of the 
same protocol on two networks. To make this claim precise, we need some notation: if 
p : G ^ B is a hbration, and a: is a global state of B, we can obtain a global state 
xf of G by “lifting” the global state of B along each hbre, that is, (x‘^)i = (see 
FigureEll. Essentially, starting from a global state of B we obtain a global state of G by 
copying the state of a processor of B hbrewise. 

Lemma 1 (Lifting Lemma llBV97al ). Let p : G ^ B be a fibration. Then, for every 
protocol P and every synchronous computation x^, x^, . . . ofPonB, . . . 

is a synchronous computation of P on G. 
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These considerations may seem trivial when applied to a single network, but things 
become trickier when a whole class is involved. Indeed, if two different networks G and 
H of the class are hbred over a common base B, then the output of the two networks 
with respect to inputs that are liftings of the same input of B are inextricably related. 

Indeed, these observations are already sufficient to show that majority is not com- 
putable in the class of Figure 0 In Figure 0 we show that two networks G and H of 
our class are fibred over a common base B. Suppose by contradiction that you have a 
protocol computing the majority. If we run this protocol on B with, say, input 0 to a and 
1 to b, it will terminate with output w (0 or 1 are both correct; note that it must terminate, 
because it terminates on G and H). However, if we lift this input to G and H, uj will not 
be a correct output for one of them — contradiction. 




Fig. 5. Two networks with a common base. 



4 Characterizing Solvability for Arbitrary Classes 

Using hbration we have proved an impossibility result for our example. However, there 
is a missing link: processor can build views, but are constrained in their behaviour by 
the bases of the network they live in. The connection between these two concepts lies in 
the notion of minimum base. 

4.1 Minimum Bases 

We say that a graph G is fibration prime if every hbration from G is an isomorphism, 
that is, G cannot be collapsed onto a smaller network by a hbration. 

Theorem 1 (|IBV]|) The following properties hold: 

1. For every graph G there is (up to isomorphism) exactly one fibration-prime graph 
G such that G is fibred onto (i.e., surjectively) G. 

2. IfG and H have a cornmon base B, then G = H. 

3. Let ip : G ^ G; then G* = G^ IffFi't) = fU) ^ consequence, distinct nodes of 
a fibration prime graph have different views). 

4. A fibration-prime graph is uniquely characterized by the set of views of its nodes. 
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There are many ways to build G. One is a partition set algorithm |E3/]|, similar to the 
one used for finite-state automaton minimization. On the other hand, one can also take 
as nodes of G the distinct views of G, and put an arc between two views if one view is 
a first-level child of the second one. 

The fibrations from G to G are called minimal. There is usually more than one 
minimal fibration, but they must all coincide on the nodes by property Q) of TheoremGl 
so we denote with p.Q one of them when the map on the arcs is not relevant. 

In Figured we show a number of networks and the corresponding minimum bases. 
Again, the node component of a minimal fibration is represented by suitably labelling 
the nodes. Another example of minimum base was given in Figure 0 the graph B was 
the minimum base of both G and H, that is, G = H = B. 




Fig. 6. Some examples of minimal fibrations and minimum bases. 



The previous theorem highlights the deep link between fibrations and views: two 
processors have the same view if and only if they lie in the same fibre of p. Since we 
know by the Lifting Lemma that processors in the same fibre of a fibration cannot behave 
differently, this means that on the one hand, processors with the same view will always 
be in the same state, and on the other hand, if we can deduce G from a view we can use 
the Lifting Lemma to characterize the possible behaviours. 

The fundamental fact we shall use intensively in all proofs is that the above consid- 
erations, which involve infinite objects (isomorphism of infinite trees, and so on), can 
be described by means of finite entities using the following theorem: 



Theorem 2 Let G be a strongly connected graph and B a fibration-prime graph 

with minimum number of nodes such that, for some node iofG and node j of B, G* and 
B^ are isomorphic up to height nc + {diameter of G): then B = G, and i is mapped to 
j by all minimal fibrations. 

How can a processor use the (apparently unfathomable) previous theorem to build 
the minimum base? First of all, in 2N — 1 rounds all processors build their views up 
to height 2N — 1, where is a known bound on the number of processors (and thus 
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— 1 bounds the diameter). Then, they perform locally an exhaustive search for the 
only graph B satisfying the theoremfl 

Note that since our processors have additional information given by their inputs, the 
constructions described here must be performed taking the inputs into account. More 
precisely, views and minimum bases must be constructed “as if” the processors were 
(additionally) coloured with their inputs. For instance, in FigureQwe show the minimum 
base for a 4-cycle with respect to different inputs. 




Fig. 7. Minimal fibrations of the same graph with respect to different inputs. 



4.2 The (A)Synchronous Case 

We are finally ready to characterize the classes of networks on which (a)synchronous 
computation of a relation R is possible. First of all we notice that, from a computability 
viewpoint, there is no difference between the synchronous and the asynchronous case, 
as shown in IBV99M : thus, we restrict our proofs to the synchronous case. 

Theorem 3 Let C be a class of networks of bounded Then R is (a)synchronously 

computable on C iff for all graphs B and all v G T"® there is an lj G 17"® such that 
for all G G C and all fibrations ip : G ^ B, if{v)'^ G dom{Ro) then Rq 

Proof. If the conditions of the statement are not satisfied, then there is a graph B and a 
V G T"® such that for all uj G 17"® there is a G £ C and a fibration ip : G ^ B such 
that G dom(i?G) but not Rq By the Lifting Lemma this implies that, 

whichever protocol we use, at least one computation on some graph G in C starting from 
in((tr)‘^) will give an output that is incorrect. 

Otherwise, each processor determines its minimum base taking into account the in- 
puts, and, using the conditions above, chooses an output value that satisfies Rg (i.e., it 
looks for the correct output on the minimum base and chooses its own output accord- 
ingly). 

There are of course “smarter” ways to find G for specific types of networks; nonetheless, for 
the general case Theorem0gives an optimal bound IBVI . 

^ By this we mean that there is an integer N such that all networks in C have at most N nodes. 
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The characterization implied by statement of the previous theorem may seem to be 
daunting, but it really boils down to the following simple procedure: 

1 . Compute the minimum bases of the networks in C with respect to every possible 
input assignment. 

2. For each base, choose an output uj satisfying the conditions of the theorem. 

3. If the previous step is not possible, then the relation cannot be computed on C; 
otherwise, the chosen outputs can be put into a table and used by the processors, 
after the minimum base construction, to choose their output. 

All the above steps are finite if C and R are finite. Thus, the characterization is recursive. 

We can now apply the characterization to our original problem. First of all, we have 
to throw away one of the last two networks: as we already argued (and as easily shown 
using the above theorem) there is no majority algorithm for the class if both networks 
of Figure El are present. Hence, we restrict to the first four networks of Figure Q] 

We note that the first three networks in Figure Q]have a different minimum base than 
the others, as shown in Figure 0 thus, we can consider them separately. 




Fig. 8. The remaining networks and their minimum base. 



Now, a trivial application of the procedure we outlined shows that the conditions of 
TheoremElare satisfied: the only subtle point is that if we assign different inputs to the 
two nodes of the minimum base, there is always a choice for the output that works when 
lifted to the three networks. 

A similar investigation shows, for instance, that election is not solvable in the re- 
stricted class of Figure 0 no matter which processor we decide to elect on the base, 
there will be two processors elected in the second network. If the latter is eliminated, 
however, election becomes possible. 

4.3 The Interleaved Case 

We sketch the ideas behind the characterization of the interleaved case. We say that a 
fibration is acyclic if the subgraphs induced by each fibre are acyclic (except for loops). 
Acyclicity guarantees that there is always an interleaved activation working fibrewise, 
so that the Lifting Lemma can be extended to this case (as done in IBV97ain . More 
precisely, for each interleaved activation on the base i? of a fibration (p : G ^ B one 
can find an inferleaved activation of G such that processors are activated fibre by fibre. 
The activation in each fibre sfarfs from a sink in the fibre and continues inductively so 
fhat no processor in the fibre is activated before one of its successors in the fibre. 
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On the positive side, processors use a simple 0-1 game (as shown in IRCG+Qhll 
for election) to break asymmetry until they obtain a minimal acyclic fibration (that is, a 
fibration that collapses a network as much as possibly without violating acyclicity). The 
theory in this case is a bit different, as there are many possible, nonisomorphic, minimal 
bases. Nonetheless, the statement of the characterization remains the same: 

Theorem 4 Let C be a class of networks of bounded size. Then R is computable with 
interleaved activation on C iff for all graphs B and all v G T"® there is an ut G 17"® 
such that for all G G C and all acyclic fibrations (p : G ^ B, if(v)^ G dom^Rc) then 

{vY rg 

Proof If the conditions of the statement are not satished, then there is a graph B and a 
V G T"® such that for all ut G 17"® there is a G G C and an acyclic fibration (p : G ^ B 
such that {v)‘^ G dom{Rc) but not Rq By the Lifting Lemma (extended 

to ayclic fibrations and interleaved activations) this implies that, whichever protocol we 
use, at least one computation on some graph G starting from in((r>)‘^) will give an output 
that is incorrect. 

With respect to the proof of TheoremO] we have to show that we can break symmetry 
in such a way to obtain a fibration p : G ^ B which is acyclic. To this purpose, the 
processors first compute a standard minimal hbration (taking inputs into account); then, 
processors in the same hbre label themselves either 0 or 1 according to the following 
rule: 

- if the processor is activated and all in-neighbours in its hbre are unlabelled, it labels 
itself 0; 

- if the processor is activated and any in-neighbours in its hbre is labelled, it labels 
itself 1. 

At the end of this “0-1 game”, the processors compute again the hbres of a minimal 
hbration, this time taking into account the 0-1 identiher they possess. This process can 
be repeated, and can only terminate when the minimal hbration is acyclic; this happens 
because at each step every nontrivial hbre containing an oriented cycle breaks at least 
into two pieces (in such a hbre, the hrst activated processor labels itself 0, and among the 
processors belonging to an oriented cycle the least activated processor necessarily labels 
itself 1). Now all processors can select an output using the conditions in the statement. 

If we get back to our guiding example, we can note that all networks in Figure [H are 
minimal with respect to acyclic hbrations, except for the third, which is acyclically 
hbred over the hrst one. Again, a simple case-by-case examination shows that now both 
majority and election are possible, due to the great desymmetrizing power of interleaved 
activation. 

5 Some Easy Applications 

We give a few theorems outlining more general applications of our results. To our knowl- 
edge, these are the hrst results for these problems (the results apply also to the interleaved 
model). 
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Theorem 5 Let C be the class of all bidirectional networks of size n ( that is, every arc 
has an associated arc in the opposite direction). Then majority is computable on C. 

Proof (Sketch) The linear constraints imposed by bidirectionality determine uniquely 
the cardinality of the fibres of a fibration once its base is known iPO). Thus, the output 
vector w in the statement of TheoremGIcan be chosen by counting the actual multiplicity 
of each input. 

We can give a more elementary point of view on the previous theorem: by solving a 
linear system, each processor can determine the cardinality of all fibres of the minimal 
fibration. Then, it can compute locally the multiplicity of each input, and thus derive the 
correct output. 

Note that the above theorem is independent of the colouring on the arcs, and thus 
of the communication model. For instance, it works in broadcast networks without 
distinguished incoming links. 



Theorem 6 Let C be a class of networks with distinguished outgoing links ( that is, every 
arc has a colour and arcs going out of the same processor get different colours) and 
bounded size. Then majority is computable on C. 

Proof. Since the outgoing links have different colours, all fibrations starting from a 
network in C have the property that all fibres have the same cardinality (this happens 
because they are really coverings — see KBVl ). Thus, lifting multiplies each input the 
same number of times; hence, the output vector a; can be chosen by taking a majority 
vote on V. 

The theorem above would not hold if we required distinguished incoming links. A 
counterexample can be built by colouring all arcs of the networks B of Figure with 
different colours, and lifting the colours on the networks G and H. 

It is interesting to contrast the behaviour of the counting problem in the above two 
cases: each processor must compute how many processor with its own input are present. 



Theorem 7 Let C be the class of all bidirectional networks of size n (that is, every arc 
has an associated arc in the opposite direction). Then counting is computable on C. 



Theorem 8 Let C be a class of networks with distinguished outgoing links ( that is, every 
arc has a colour and arcs going out of the same processor get different colours) and size 
n. Then counting is computable on C. 

The theorem above does not hold if just a bound on the size is known, and, of course, 
if we required distinguished incoming links. The proof of the last statements should be 
now an easy exercise for the reader. 
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5.1 Conclusions 

We have only surfaced the range of applications of the general theory of anonymous 
computation sketched in this paper. For instance, the theory also allows one one to 
establish the possibility of weak computability. A relation is weakly computable on a 
class C if there is a protocol that works correctly on all networks of the class on which 
the relation is computable, but stops all processors in a special “impossibility state” in 
the remaining networks. In other words, the processors either compute the result, or 
establish in a distributed and anonymous way that this is impossible on specific network 
they live in (previous literature often confused the strong and weak version of a problem). 
Of course, in general the classes on which a relation is weakly computable are much 
larger than in the standard case. It is not difficult to derive from TheoremGI a necessary 
and sufficient condition for weak computability. 

As we noticed elsewhere, the bound given on the number of steps required to compute 
a relation (the number of nodes plus the diameter) is optimal, as there are classes of graphs 
in which topology reconstruction cannot be performed in less than that number of steps. 
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Abstract. The Farsite distributed file system stores multiple replicas 
of files on multiple machines, to provide file access even when some ma- 
chines are unavailable. Farsite assigns file replicas to machines so as to 
maximally exploit the different degrees of availability of different ma- 
chines, given an allowable replication factor R. We use competitive anal- 
ysis and simulation to study the performance of three candidate hill- 
climbing replica placement strategies, MinMax, MinRand, and RandRand, 
each of which successively exchanges the locations of two file replicas. We 
show that the MinRand and RandRand strategies are perfectly competi- 
tive for R = 2 and 2/3-competitive for R — 3. For general R, MinRand 
is at least 1/2-competitive and RandRand is at least 10/17-competitive. 
The MinMax strategy is not competitive. Simulation results show better 
performance than the theoretic worst-case bounds. 



1 Introduction 

This paper analyzes algorithms for automated placement of file replicas in the 
Farsite 0 system, using both theory and simulation. In the Farsite distributed 
file system, multiple replicas of files are stored on multiple machines, so that 
files can be accessed even if some of the machines are down or inaccessible. The 
purpose of the placement algorithm is to determine an assignment of file replicas 
to machines that maximally exploits the availability provided by machines. 

The file placement algorithm is given a fixed value, R, for the number of 
replicas of each file. For systems reasons, we are most interested in a value of 
i? = 3 PI. However, to ensure that our results are not excessively sensitive to the 
file replication factor, we also provide tight bounds for R = 2 and lower bounds 
for all R (tight at different values of R). 

Our theoretic investigations cover an arbitrary distribution of machine avail- 
abilities and show worst-case behavior for a slightly abstracted model of the 
problem. For these studies, we assume an adversary that can establish - and con- 
tinuously change - the availability characteristics for all machines, and we assess 
the ability of our algorithms to maximize the minimum file availability relative 
to the optimally achievable minimum file availability, for any given assignment 
of machine availability values. We do not attempt to classify the computational 
complexity of the problem, because it is not a classic input-output algorithm. 

J. Welch (Ed.): DISC 2001, LNCS 2180, pp. 48-^ 2001. 
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Our simulations are driven by actual measurements of machine availability 
0 and show average-case behavior for a specific set of real-world measurements. 
For these studies, we consider not only the minimum file availability but also the 
distribution of file availability values. In all cases, we use a logarithmic measure 
for machine and file availability values, in part because of its standard usage m 
and computational convenience, but also because a linear measure understates 
the differences between results, since for all algorithms the minimum-availability 
file is available for a fraction of time that is very close to unity. 

The next section describes the Farsite system and provides some motivation 
for why file replica placement is an important problem. Section 3 describes the 
algorithms, followed by a summary of results in Section 4. Section 5 presents a 
simplified theoretic model of the Farsite system environment, which is used in 
Section 6 to analyze the performance of the algorithms. Section 7 describes the 
environment for our simulations, the results of which are detailed in Section 8. 
Related work is discussed in Section 9. 



2 Background 

Farsite 0 is a secure, highly scalable, serverless, distributed file system that 
logically functions as a centralized file server without requiring any physical cen- 
tralization whatsoever. The system’s computation, communication, and storage 
are distributed among all of the client computers that participate in the system. 
Farsite runs on a networked collection of ordinary desktop computers in a large 
corporation or university, without interfering with users’ local tasks, and without 
requiring users to modify their behavior in any way. As such, it needs to pro- 
vide a high degree of security and fault tolerance without benefit of the physical 
protection and continuous support enjoyed by centralized server machines. 

There are four properties that Farsite provides for the files that it stores: pri- 
vacy, integrity, reliability, and availability. Data privacy is afforded by symmetric- 
key and public-key encryption, and data integrity is afforded by one-way hash 
functions and digital signatures. Reliability, in the sense of data persistence, is 
provided by making multiple replicas of each file and storing the replicas on 
different machines. The topic of the present paper is file availability, in the sense 
of a user’s being able to access a file at the time it is requested. 

Like reliability, file availability is provided by storing multiple replicas of 
each file on different machines. However, whereas the probability of permanent 
data-loss failure is assumed to be identical for all machines, the probability of 
transitory unavailability (such as a machine’s being powered off temporarily) is 
demonstrably not identical for all machines. A five-week series of hourly mea- 
surements of more than 50,000 desktop machines at Microsoft 0 has shown 
that (1) machine availabilities vary dramatically from machine to machine, (2) 
the measured availability of each machine is reasonably consistent from week 
to week, and (3) the times at which different machines are unavailable are not 
significantly correlated with each other. 
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A file is not available if all the machines that store the replicas of the file 
are temporarily down. Given uncorrelated machine downtimes, the fraction of 
time that a file is unavailable is equal to the product of the fractional downtimes 
of the machines that store replicas of that file. We express availability as the 
negative logarithm of fractional downtime, and then the availability of a file is 
equal to the sum of the availabilities of the machines that store the file’s replicas. 
The goal of a file placement algorithm is to produce an assignment of file replicas 
to machines that maximizes the minimum file availability over all files without 
exceeding the available space on any machine. 

Measurements of over 10,000 file systems on desktop computers at Microsoft 
0 indicate that machines experience permanent data-loss failures (e.g. disk head 
crashes) in a temporally uncorrelated fashion. We do not allow the algorithm 
to vary the number of replicas on a per-file basis, because this would introduce 
variance into the distribution of file reliability. The measurements show that a 
value of i? = 3 is achievable in a real-world setting jSj . 

3 Algorithms 

To be suitable for a distributed file system, a replica placement algorithm must 
satisfy two essential requirements: It must be incremental and distributable. 
Because the system environment is constantly changing, the algorithm must 
be able to improve an existing placement iteratively, rather than requiring a 
complete re-allocation of storage resources when a file is created or deleted, 
when a machine arrives or departs, or when a machine’s availability changes. 
Because the file system is distributed, the algorithm must scale with the size of 
the system and must operate by making small changes of strictly local scope. 
To satisfy these requirements, we concentrate on hill-climbing algorithms, in 
particular those that perform an ordered succession of swap operations, in which 
the machine locations of two file replicas are exchanged. 

Specifically, we investigate the properties of three algorithms: (1) MinMax, 
in which the only allowed replica-location swaps are between the file with the 
minimum availability and the file with the maximum availability, (2) MinRand, 
in which swaps are allowed only between the file with the minimum availability 
and any other file, and (3) RandRand, in which swaps are allowed between any 
pair of files. In general, file replicas are swapped between machines only if the 
swap reduces the absolute difference between the availabilities of the files and 
only if there is sufficient free space on each machine to accept the replicas that 
are being relocated. If there is more than one successful swap for two given files, 
our algorithm chooses one with minimum absolute difference between the file 
availabilities after the swap. However, for our theoretical worst-case analysis, 
this does not matter. 

The intuition behind these algorithms is as follows: RandRand is the most 
general swap-based strategy, in that it allows swaps between any pair of files, 
so it represents a baseline against which to compare and contrast the other 
algorithms. We are most concerned with improving the minimum file availability. 
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and since a replica exchange only affects the two files whose replica locations 
are swapped, it makes sense for one of these files to be the one with minimum 
availability, hence MinRand. The motivation behind MinMax is that the maximum 
availability file seems likely to afford the most opportunity for improving the 
minimum availability file without excessively decreasing its own availability. 

In actual practice, the file placement algorithm executes in a distributed 
fashion, wherein the files are partitioned into disjoint sets, and each set is man- 
aged by an autonomous group of a few machines. At each iterative step, one 
of the groups contacts another group (possibly itself), each of the two groups 
selects one of the files it manages, and the groups jointly determine whether 
to exchange machine locations of one of the replicas of each file. Therefore, the 
MinMax and MinRand algorithms are not guaranteed to select files with globally 
extremal availability values. For our theoretic analyses, we concentrate on the 
more restrictive case in which only extremal files are selected. For our simulation 
studies, we model this extremal discrepancy by selecting from a range of files 
with the highest or lowest availability rank. 

4 Summary of Results 

In this paper we perform a worst-case analysis and a simulation to determine 
the efficacy of three hill-climbing algorithms, where the efficacy of an algorithm 
is specified by the availability of a file with minimum availability. We denote 
the efficacy of an algorithm by its competitive ratio p = m/m* , where m is the 
efficacy of the hill-climbing algorithm, and m* is the efficacy of an optimal algo- 
rithm. We show - for both theory and simulation - that the MinRaind algorithm 
performs (almost) as well as the RcUidRand algorithm. The MinMax algorithm 
performs poorly throughout. Here is a detailed summary of our results: 



Algorithm 


MinMax 


MinRand 


RandRand 


Worst-case R = 3 


p = 0 (Thm. Ej) 


p = 2/3 (Thm. ED 


p= 2/3 (Thm.p 


Simulated R = 3 


p Ki 0.74 (Fig. El) 


p ~ 0.93 (Fig. ED 


p Ri 0.91 (Fig. ED 


Worst-case R = 2 


p = 0 (Thm. 0 


p= 1 (Thm. ED 


p = 1 (Thm. ED 


Lower bounds any R 


p = 0 (Thm. El) 


p > 1/2 (Thm. ^ 


P> 10/17 (Thm. ED 



5 Theoretic Model 

We are given a set of N unit-size files, each of which has R replicas. We are also 
given a set of M — N ■ R machines, each of which has the capacity to store a 
single file replica. Machines have availabilities > 0, i = 1, . . . , M, given as 
negative logarithms of machines’ downtimes. 

Let the R replicas of file / be stored on machines with availabilities 
a\, . . . , a)j. To avoid notational clutter, we overload a variable to name a file 
and to give the availability value of the file. Thus, the availability of file / is 
f = a{ + --- + a{^. 



52 



J.R. Douceur and R.P. Wattenhofer 



Let m be a file with minimum availability when the algorithm has exhausted 
all possible improvements. Let m* be a file with minimum availability given an 
optimal placement for the same values of N, R, and {i = 1,... ,M). We 
compute the ratio p = min m/m* over all allowable ai as N ^ oo. We say that 
the algorithm is /r?-competitive. 

In the practical algorithms, the particle “Raind” stands for a random choice, 
i.e. the MinRand algorithm tries to exchange machines between minimum-avail- 
ability file m and a randomly chosen file /. In the theoretical analysis however, 
“Rand” is treated as “Any”, i.e. the MinRand algorithm tries to exchange ma- 
chines between the minimum-availability file m and any other file /. The algo- 
rithm stops ( “freezes” ) only after all legal pairs of files have been tested. We use 
the particle “Raind” rather than “Any” to have a consistent terminology to the 
simulation part of this work. 

If two or more files have minimum availability, or if two or more files have 
maximum availability, we allow an adversary to choose which of the files can be 
swapped. 

The number of possible machine exchanges between two given files grows 
exponentially with the number of replicas R. In this paper we do not study this 
problem. We are predominantly interested in systems where R is small; for large 
R we will give an recipe linear in R that finds machines to be exchanged (see 
Lemma ^ . 

Note that the set of legal pairs of files for the MinRsind algorithm is a subset of 
the set of legal pairs of files for the RandRsind algorithm. That is, if the MinRand 
algorithm freezes, it is possible that the RandRand algorithm would still find a 
successful exchange. A freeze of the RandRand algorithm however also implies 
that the MinRaind algorithm would not find a successful exchange. Similarly, the 
singleton set of legal pairs for MinMax is a subset of the legal pairs for MinRand. 
Thus, the efficacy of the RandRauid (MinRand) algorithm is at least as high as the 
efficacy of the MinRand (MinMax) algorithm. Formally, 

Lemma 1. PMinMax — PMinRan.d — PRandRcUid- 

With LemmaOwe are in the position to find the competitive ratio of different 
algorithms by simply giving a worst-case example for the stronger algorithm (e.g. 
RandRaind), and a qualitative argument on the weaker algorithm (e.g. MinRand). 
When discussing the competitive ratio of an algorithm with a specific replica- 
tion factor, we append the replication factor R to the algorithm name, i.e. the 
competitive ratio of the MinRand algorithm for replication factor 3 is PMinRandS ■ 

If possible we simplify the arguments by linearly scaling the machine avail- 
abilities such that m = 1 throughout this paper. Note that this does not change 
the competitive ratio p. 

6 Competitive Analysis 



We start with R = 3, the case we are most interested in. 
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Lemma 2 . PMinRand5 ^ 2/3. 

Proof. The intuition of the proof is as follows: We define a non-decreasing func- 
tion g, the argument of g is a non-negative real (a machine availability), and g 
returns a real. We define G := where the set A is the availabilities 

of all machines. 

In the first part of the proof we show that the MinRand algorithm freezes 
with G < iV. In the second part of the proof we consider an optimal assignment 
of machines to files. If all files in the optimal assigment have availability strictly 
greater than 3/2, we show that G > N. Since N < G < N is a contradiction, the 
minimum file of the optimal assignment has availability m* <3/2. With m = 1 
the proof will follow. 

Here are the details: Let m = Oi -I- + 03 be the file with minimum avail- 

ability, with Oi > 02 > 03. Of particular interest is 03, the minimum-availability 
machine of minimum file m. Let g{a) be a function applied on the availability a 
of a machine, with 



{ 1 if o > 1 — 03 

1/2 if 1/2 < o < 1 - 03 
1/4 ifo3 <a<l /2 
0 if o < 03 

The function g{f) applied to file / is simply g{f) = g{bi) + 5(62) + gibs), if 
file / = 61 -l- 62 + 63. 

Let / = 61 -I- &2 + ^3 be another file with availability / > 1, and with bi > 
62 > bs- We assume that there is no successful machine exchange between the 
files / and m. We denote the file / (m) after an exchange with /' (m'). Note that 
to' > 1 and /' > 1 would contradict the assumption that the MinRand algorithm 
freezed, since / > to = 1. 

We distinguish several cases. 

Case 1: Let 61 > 1 — 03. If &2 > 03 we can exchange the machines &2 and 03, 
such that /' > 61 -I- 03 > (1 — 03) -I- 03 = 1, and to' = to -I- 62 — 03 > 1. Therefore 
&2 < 03. Thus gif) = gibi) + 5(62) + gibs) = l-bO-bO = l. 

In all other cases we therefore have 61 < 1 — 03. 

Case 2: If 63 < 03, then gif) < 1/2 -|- 1/2 -|- 0 = 1. 

In all other cases we have 63 > 03. 

Case 3: Let 62 > 1/2. Since 63 > 03 we can exchange the machines bs and 03, 
such that /' > + 62 > 1) and to' = to -I- 63 — 03 > 1, which is a contradiction 

to the assumption that there was no successful exchange. Therefore we have 
^2 < 1/2, and thus gif) < 1/2 -|- 1/4 -|- 1/4 = 1. 

So far we have shown that gif) < 1 for each file. A simple case study reveals 
that the minimum file to itself has (/(to) < 3/4. Since each machine is part of 
exactly one file we have G = gif), where F is the set of all files. Since 

|F| = N we can conclude 

G=Y 1 9 if) < (N - 1) • 1 + 3/4 < W 

f&F 
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In the second part of the proof we will show that if a file / has a sufficiently 
high availability, then g{f) will be at least 1. Specifially, we will show that for 
any file / = + &2 + ^3 (with 61 > 62 ^ ^ 3 ) we have 

/ > 3/2 ^ g{f) > 1. 

We distinguish two cases: 

Case 1: If 61 > 1 — 03 or 62 > 1/2 then g{f) > 1, because g{bi) > I or 
+ 5 (^ 2 ) > 1 - 

Case 2: We have bi < I — 03 and 62 < 1/2. If 61 < 1/2, then / = 61 + 62 + &3 < 
3/2 (not satisfying the precondition that / > 3/2). Thus 1/2 < &i < 1 — 03 and 
^2 < 1/2. We have 3/2 < / = 61 + 62 + &3 < (1 — as) + 1/2 + 63 63 > as. Then 

gif) = 1/2 + 1/4 + 1/4 = 1. 

In the second part of the proof we have shown that files / with availability 
/ > 3/2 necessarily have g{f) > 1. 

The optimal algorithm assigns files to machines such that the file with min- 
imum availability is to*. Suppose, for the sake of contradiction, that an optimal 
algorithm manages to raise the availability of each file / such that / > to* >3/2. 
With the second part of the proof we know that in this case g{f) > 1 for all N 
files. Since the function g of a file is defined as a sum of the function g of the 
machines of the file, we know that G > N. With the conclusion of the first part 
of the proof we get N < G < N which is a contradiction. Therefore to* <3/2, 
and p = m/m* > 2/3. 

Lemma 3. PRandRand5 ^ 2/3. 

Proof. We give a constructive proof for a worst-case example with the three files 
TO, /i, and f 2 '. The RcmdRand algorithm freezes with to = 1-1-0 -1-0 (the minimum- 
availability file), /i = 1 -I- 1 -I- 0, and /2 = 1/2-1- 1/2-1- 1/2, that is, no machine 
exchange between any two files decreases the difference of the availabilities of 
the two files. We have nine machines with availabilities 3 x 1, 3 x 1/2, and 3x0. 
An optimal algorithm generates three files 1 -I- 1/2 -|- 0 = 3/2, thus to* = 3/2. 
Therefore PRandRand3 = ^ 2/3. 

Theorem 1. PMinRand5 ~ ^’RandRcind^ “ 2/3. 

Proof. The Theorem follows directly with the Lemmas |3 Ql and Lemma d 
For replication factor 2 the MinRand algorithm is optimal: 

Theorem 2. PMinRand^ ~ PRandRauid^ ~ 

Proof. The proof is a “light” version of the proof of Lemma 0, and the details 
are omitted in this extended abstract. The function g is defined as 

{ 1 if a > 1 — 02 
1/2 if 02 < O < 1 — 02 
0 if o < 02 
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The MinMax algorithm performs poorly in general, as the following example 
shows. 



Theorem 3. PMinMax = 0. 

Proof. We give a constructive proof for a worst-case example with (at least) 
three files: Let m = 0-|-0-|-0-|---- + 0be the file with minimum availability (note 
that m = 0), and / = 3-|-0-|-0-|---- + 0be the file with maximum availability, 
and all other files (at least one) have the machines 1-I-1+0-I---- + 0. The 
MinMax algorithm freezes since there is no exchange between the files m and /. 
For N > 3 we have 2{N — 2) + 1 > N machines with availability at least 1, and 
it is possible to reassign the machines to files such that each file has at least one 
machine with availability at least 1, that is m* > 1. Thus p < 0/1 = 0. 

We want to be confident that our algorithms do not fail with larger replication 
factors. In the following we give bounds on the performance for arbitrary R. All 
our bounds are tight for some R: As seen above, the MinMax algorithm is non- 
competitive for any R. The MinRand algorithm is worst when i? — >■ oo, and the 
RandRand algorithm is worst when i? is 7. 



Lemma 4. Let m = aid-. . .+aR be a file with availability m = 1, and a\> . . .> 
O'R ^ Q- Let f = bi + ... + bn> 1 be another file, with 1 > 6i > . . . > &_r > 0. 
Let f be f , but all the machines with availability less than an are replaced with 
machines with availability gr, that is, f = max{bi,aR) + max(& 2 ,Ofl) -|- . . . -I- 
max(&fl, o/{). Lf f > 2 we can successfully exchange machines between m and f. 

Proof. Let I be the highest index such that bi > gr, that is, either bi+i < gr or 
I = R. First we exchange the machines bi and gr, that is m' = m + bi — gr > 1 
and f' = f + GR — bi.lf f > 1 we are done, because we found a way to exchange 
machines and both availabilities are strictly greater than one. In the remainder 
of the proof we need to consider the case where f'<l only. 

As long as to' > 1 and /' < 1 we repeatedly exchange the machines a^+i and 

bR-i, for i = 0, 1, We denote to' (/') after the zth exchange with to* (/*). 

More formally: 



i i i i 

m* = to' + ^ bR^j - ^ Gj+i and /* = /' + ^ a^+i - ^ bR^j. 
j=o j=0 j=0 j=0 

In the remainder of the proof we show that these repeated exchanges will even- 
tually be successful, that is, there is a A: (with k < R — I — 1) such that after k 
exchanges we have to^ > to and > to. 

First, we show that the process of repeated exchanges terminates. In other 
words, there isafc<i?— /— 1 such that either <1 or /^ > 1. 
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In particular, ii i = R — I — 1, then 

i i 1—1 R—l — 1 

= f + o,R — bi + Oi+i — bu-j = bj + Ofl + Oj+i 

i=o j=o j=o j=o 

l-l R 

> E ma,x{bj,an) + ok + ^ max(a/{, bj) (using h > qr < aj) 

j=o j=i+i 

= f + an - bi > 2 + an - bi > 1. (using bt < 1) 

We have /* > 1, and therefore the process of repeated exchanges terminates. 

Since the process terminates after k exchanges, we have either < 1 or 
/'=>!. 

We distinguish three cases. 

Case 1: Let > 1 and /^ > 1. We have found successful machine exchanges 
since > m (m = 1) and > m. We are done. 

Case 2: Let < 1 and /* < 1. This contradicts the assumptions that m = 1 
and / > 1 because 2 > m' + f = m + f > 2. 

The only remaining case is the most difficult. 

Case 3: Let < 1 and /^ > 1. Note that before the fcth exchange it was 
decided to do another exchange, that is, > 1 and < 1. 

The precondition of this Lemma is / = &i + . . . + &/_i +bi + {R— l)an > 2. 
For readibility we split / at into two parts: f = fi + f 2 - 
Then R R 

/, = + . . . + = f-h- Y^b, = f-aR- 

2 — Z-|-l 2 — Z-|-l 

k — 1 R—k k — 1 

= f - O'R-'Y bR-i - E “ E 

2—0 i—l-\-l i—0 

k-1 k-1 

= “ E ^ 1 ” E 

2—0 2—0 

Also we have i i 

f 2 = hi-^ {R- l)an < m + 6/ - 

2^1 2^1 

k k I k I 

= w'" + E “ E “ E ^ + E “ E 

2=0 2=0 2 = 1 2=0 2=1 

Together we get 

k — 1 k I 

/ = /i + /2 < 1 ~ ®i+i ~ + 1 + Ofi + ~ E! 

2=0 2=0 2=1 

E 2 “h g^ — ct\ Y 2. 

This contradicts with the precondition / > 2. 

Only case 1 did not contradict with the assumptions and preconditions. The 
Lemma follows immediately. 
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Lemma 5. PMinRand >1/2. 

Proof. Let m = oi -I- . . . + a/j = 1 be a file with minimum availability, with oi > 
■ ■ ■ > an> 0. Let / = 5i -I- . . . + 6 _r > 1 be another file, with > . . . > > 0, 

and assume that there is no successful machine exchange. 

We distinguish two types of files /: 

Type A: Let b\ > 1. If &2 > o,r we can exchange the machines 62 and an 
such that m' = TO -I- 62 — Ofl > 1, /' > 61 > 1. Therefore b 2 < an. 

Type B: bi < 1. 

Of the N files, one is the minimum file m, x files are of type A, N — 1 — x 
files are of type B. We have exactly x machines with availability strictly greater 
than 1, that is, an optimal assignment will end up with at least N — x files that 
can only use machines with availability 1 or less. Since type A files have 62 < o,n 
we can at most replace the machines of the type B files that have availability 
less than an with machines that have availability an. With Lemma0we know 
that such an improved file of type B can at most have availability 2. An optimal 
algorithm can redistribute the files such that the total sum of availabilities is at 
most G = 2{N — X — 1) + 1. An optimal redistribution to N — x files gives the 
new minimum file an availability of at most to* < G/{N — x), thus to* < 2. 
Therefore p = m/m* >1/2. 

Theorem 4. limfi^oo PMinRandi? = 1/2. 

Proof. For simplicity let i? be a power of 2, that is R = . We construct the 

following files: 

— The minimum file to = 1/i? -I- . . . + 1/i? = 1. 

— R/2 files of type 0 with / = 2-|-l/i?-l-l/i?+... + 1/R. 

— For * = 1, . . . , r — 1: files of type i with / = 2® x 2“® -|- 0 -I- . . . + 0 

Note that there is no successful exchange between the minimum file to and 
any other file. We have used the following machines: 

— R/2 machines with availability 2, 

— For i = 1, . . . , r — 1: R/2 machines with availability 2“®, 

— R/2 ■ {R — 1) + R machines with availability 1/i? 

— The rest of the machines have availability 0. 

With the same machines we can build 

— R/2 files of type A with f = 2 + rest, that is / > 2, and 

— R/2 files of type B with f = 1/2 + 1/4 + 1/8 + . . . + 4/R + 2/R + 1/R + 

1/i? -f ... -I- 1/i?, that is / = 1 -I- (i? — r — l)/i?. 

Since limfl_>oo 1 + (i? — r — l)/i? = 2 we have to* — >■ 2~ , and therefore p = 
m/m* — >■ 1/2+ . With Lemma 1^ p — 1/2+ is tight and the Theorem follows. 



Lemma 6. PRandRand > 10/17. 
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Proof. We omit this tedious proof in the extended abstract. Apart from some 
complications it is similar to the proof of Lemma El The function g is defined as 

{ 1 if a > 1 

1/2 if2/3<a<l 

a/(2-o) if l/2<a<2/3 
a/(l + a) if a < 1/2 

Theorem 5. PRandRand = 10/17, when we allow R to be a non-integer. If R 
must be an integer then PRandRand ^3/5. 

Proof. First we are going to make a relaxation to our normal model by assuming 
that R — 6-\-e, where e > 0. (Remark: A non-integer R is an ethereal construct 
without physical realization; we need e to go towards 0 such that we can prove 
a tight bound.) 

Let m = 1 -I- 0 -I- . . . . Additionally we have a files of type 1 -I- 1 -I- 0 -I- ... , 
b files of type 1/2 -1-1/2 -1-1/2 -l-O-l-..., and c files with 6 -I- e machines with 
availability 1/(5 -I- e). Note that (6 -I- e)/(5 -I- e) = 1-1- 1/(5 -I- e). The RandRand 
algorithm does not find any successful exchange between any pair of files. If we 
do the accounting, we find 2a -I- 1 machines with availability 1, 36 machines with 
availability 1/2, c(6-|-e) machines with availability 1/(5 -I- e), and (5 -I- e) -I- a(4 -|- 
e) -I- 6(3 -I- e) machines with availability 0. 

An optimal algorithm assigns the machines so that it will get 1 -|- a -I- 6 -I- c 
files of type / = l-|-l/2-|-l/(5-|-e)-|-0-|-... will have 1 -I- a -I- 6 -|- c machines 
with availability 1, 1/2, or 1/(5 -I- e) each, and ( 1 -I- a -I- 6 -I- c)(3 -I- e) machines 
with availability 0. This is possible if and only if the following equation system 
is solvable: 

2a -1-1 = l-l-a-|-6-|-c (for availibility 1) 

36= l-|-a-|-6-|-c (for availibility 1 /2) 
c(6 -l-e) = l-l-a-l-6-l-c (for availibility 1/(5 -I- e)) 

(5 -I- e) -I- a(4 -I- e) -I- 6(3 -l-e) = (1 -I- a -I- 6 -I- c) (3 -I- e) (for availibility 0) 

We can solve this equation system with a = 9/e -I- 1, 6 = 6/e -I- 1, c = 3/e. 
Thus we have a worst case example that is tight with Lemma El 

m* = f= lim 1 -k 1/2 -k 1/(5 -k e) = 17/10. 

e->0+ 

If R has to be integer we choose e = 1, and get m* = 1 -k 1/2 -k 1/6 = 5/3. 

7 Simulated Environment 

We are given a set of M = 51, 662 machines with a measured distribution of avail- 
abilities in the range of 0.0 to 3.0 0, calculated as the negative decimal logarithm 
of the machines’ downtimes. (The common unit for availability is the “nine”; for 
example, a machine with a fractional downtime of 0.01 has — logj^Q 0.01 = 2 nines 
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file availability (nines) 



MinMax MinRand RandRand] 



Fig. 1. File availability distributions 



of availability, intuitively corresponding to its fraction uptime of 1 — 0.01 = 0.99.) 
We are given a set of files whose sizes are governed by a binary lognormal dis- 
tribution with m(2) = 12.2 and s(2) = 3.43 | 7 ]. We simulate N = 2,583,100 
files, averaging 50 files per machine, which runs at the memory limit of the 512- 
MB computer we use for simulation. We maintain excess storage capacity in the 
system, without which it would not be possible to swap file replicas of different 
sizes. The mean value of this excess capacity is 10% of each machine’s storage 
space, and we limit file sizes to less than this mean value per machine. We fix 
the replication factor i? = 3 jSj- 

At each step, a pair of files is selected randomly (uniform distribution). To 
model the fact that in a distributed environment the selection of files for swap- 
ping is done without global knowledge, we set a selection range for minimum 
and maximum availability files 2%, to be consistent with a mean value of 50 files 
per machine. In other words, the “minimum-availability” file is drawn from the 
set of files with the lowest 2% of availabilities, and the “maximum-availability” 
file is drawn from the set of files with the highest 2% of availabilities, uniformly 
at random. 



8 Simulation Results 



We apply each of the algorithms to an initial random placement and run until the 
algorithm reaches a stable point. Fig. C] shows file availability distributions for 
the three algorithms. The MinMax algorithm shows the widest variance, with an 
almost linear file availability distribution between 4 and 5 nines. The RandRand 
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Fig. 2. Minimum vs. Mean File Availability 



algorithm yields a much tighter distribution, and the MinRamd distribution is 
almost exactly the same as that for RcindRand, except for the upper tail. 

To study how the minimum file availability varies, we apply each of the algo- 
rithms to 100 randomly selected subsets of 100 machines from the measured set 
of 51,662 machines. Fig. El shows a box plot E2I of the ratio of the minimum file 
availability to the mean file availability for the three algorithms. The “waist” in 
each box indicates the median value, the “shoulders” indicate the upper quar- 
tile, and the “hips” indicate the lower quartile. The vertical line from the top 
of the box extends to a horizontal bar indicating the maximum data value less 
than the upper cutoff, which is the upper quartile plus 3/2 the height of the 
box. Similarly, the line from the bottom of the box extends to a bar indicating 
the minimum data value greater than the lower cutoff, which is the lower quar- 
tile minus 3/2 the height of the box. Data outside the cutoffs is represented as 
points. 

The worst-case ratio that our simulation encountered for the MinMax algo- 
rithm is 0.74, which is poor but significantly better than the value of 0 which our 
competitive analysis showed is possible. The worst-case ratio found for MinRand 
is 0.93, and the worst-case ratio found for RandRand is 0.91. These values are 
both better than the theoretic competitive ratio of 2/3 for the two algorithms. 
Note that the simulation ratios use mean file availability as the denominator, 
rather than the optimum minimum file availability, since the latter is not easily 
computable. Therefore, the ratios may be artificially lowered by the possibly in- 
correct assumption that the mean file availability is an achievable value for the 
availability of the minimum file. 
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9 Related Work 

Other than Farsite, serverless distributed file systems include xFS 0 and Frangi- 
pani CHI, both of which provide high availability and reliability through dis- 
tributed RAID semantics, rather than through replication. Archival Intermem- 
ory u and OceanStore m both use erasure codes and widespread data distri- 
bution to avoid data loss. The Eternity Service ^ uses full replication to prevent 
loss even under organized attack, but does not address automated placement of 
data replicas. A number of peer-to-peer file sharing applications have been re- 
leased recently: Napster ini and Gnutella m provide services for finding files, 
but they do not explicitly replicate files nor determine the locations where files 
will be stored. Freenet |E| performs file migration to generate or relocate replicas 
near their points of usage. 

To the best of our knowledge this is the first study of the availability of 
replicated files. We know of negative results of hill-climbing algorithms in other 
areas, such as clustering CH. 

There is a common denominator of our work and the research area of approxi- 
mation algorithms, especially in the domain of online approximation algorithms 
such as scheduling m- In online computing, an algorithm must decide 
how to act on incoming items without knowledge of the future. This is related 
to our work, in the sense that a distributed hill-climbing algorithm also makes 
decisions locally (without knowledge of the whole system), and where an adver- 
sary continously changes the parameters of the system (e.g. the availabilities of 
the machines) in order to damage a good assignment of replicas to machines. 

Open Problems 

There are a variety of questions we did not tackle in this paper. First, we focused 
on giving bounds for the efficacy of the three algorithms rather than efficiency. It 
is an interesting open problem how quickly the hill-climbing algorithms converge; 
both in a transient case (where we fix the availabilities of the machines and start 
with an arbitrary assignment of machines to files), and also in a steady-state 
case (where during an [infinite] execution of the algorithm an adversary with 
limited power can continuously change the availabilities of machines); see |H1 for 
a simulation of the transient case and |H| for a simulation of the steady-state 
case. 

It would also be interesting to drop some of the restrictions in this paper, in 
particular the simplification that each file has unit size. 

Finally, it is an open problem whether there is another decentralized hill- 
climbing algorithm that has better efficacy and efficiency than the algorithms 
presented in this paper. For example, does it help if we considered exchanges 
between any group of three or four files? Or does it help to sometimes consider 
“downhill” exchanges too? In general, we would like to give lower bounds on the 
performance of any incremental and distributable algorithm. We feel that this 
area of research has a lot of challenging open problems, comparable with the 
depth and elegance of the area of online computation. 
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Abstract. We present an algorithm for propagating updates with infor- 
mation theoretic security that propagates an update in time logarithmic 
in the number of replicas and linear in the number of corrupt replicas. 
We prove a matching lower bound for this problem. 



I cannot tell how the truth may be; I say the tale as ’twas said to 
me. - Sir Walter Scott 

1 Introduction 

In this paper, we consider the problem of secure information dissemination with 
information theoretic guarantees. The system we consider consists of a set of 
replica servers that store copies of some information, e.g., a file. A concern 
of deploying replication over large scale, highly decentralized networks is that 
some threshold of the replicas may become (undetectably) corrupt. Protection 
by means of cryptographic signatures on the data might be voided if the corrup- 
tion is the action of an internal intruder, might be impossible if data is generated 
by low powered devices, e.g., replicated sensors, or might simply be too costly to 
employ. The challenge we tackle in this work is to spread updates to the stored 
information in this system efficiently and with unconditional security, while pre- 
venting corrupted information from contaminating good replicas. Our model is 
relevant for applications that employ a client-server paradigm with replication 
by the servers, for example distributed databases and quorum-systems. 

More specifically, our problem setting is as follows. Our system consists of 
n replica servers, of which strictly less than a threshold b may be arbitrarily 
corrupt] the rest are good replicas. We require that each pair of good servers is 
connected by an authenticated, reliable, non-malleable communication channel. 
In order to be able to distinguish correct updates from corrupted (spurious) 
ones, we postulate that each update is initially input to an initial set of a good 
replicas, where a is at least b, the presumed threshold on the possible number 
of corrupt replicas. In a client-server paradigm, this means that the client’s 
protocol for submitting an update to the servers addresses all the replicas in the 
initial set. The initial set is not known apriori, nor is it known to the replicas 
themselves at the outset of the protocol, or even during the protocol. Multiple 
updates are being continuously introduced to randomly designated initial sets, 
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and the diffusion of multiple updates actually occurs simultaneously. This is done 
by packing several updates in each message. Because we work with information 
theoretic security, the only criterion by which an update is accepted through 
diffusion by a good replica is when b different replicas independently vouch for 
its veracity. It should be stressed that we do not employ cryptographic primitives 
that are conditioned on any intractability assumptions, and hence, our model is 
the full Byzantine model without signatures. 

The problem of secure information dissemination in a full Byzantine environ- 
ment was initiated in fYIMHh9| and further explored in [MH.H,S()r| . Because of 
the need to achieve information theoretic security, the only method to ascertain 
the veracity of updates is by replication. Consequently, those works operated 
with the following underlying principle: A replica is initially active for an up- 
date if it is input to it, and otherwise it is passive. Active replicas participate in 
a diffusion protocol to disseminate updates to passive replicas. A passive replica 
becomes active when it receives an update directly from b different sources, and 
consequently becomes active in its diffusion. For reasons that will become clear 
below, we call all algorithms taking this approach conservative. More formally: 

Definition 1. A diffusion algorithm in which a good replica p sends an update u 
to another replica q only ifp is sure of the update’s veracity is called conservative. 

In contrast, we call non-conservative algorithms liberal. Conservative algo- 
rithms are significantly limited in their performance. To illustrate this, we need 
to informally establish some terminology. First, for the purpose of analysis, we 
conceive of propagation protocols as progressing in synchronous rounds^ though 
in practice, the rounds need not occur in synchrony. Further, for simplicity, we 
assume that in each round a good replica can send out at most one message 
(i.e., the Fan-out, is one); more detailed treatment can relate to F"®"* as 

an additional parameter. The two performance measures introduced in [IVI IVI HDD] 
are as follows (precise definitions are given in the body of the paper): 

— Let Delay denote the expected number of communication rounds from when 
an update is input to the system and until it reaches all the replicas; 

— Let Fan-in (A®") denote the expected maximum number of messages received 
by any replica from good replicas in a round (intuitively, the A*" measures 
the “load” on replicas). 

In IMMB.flfl] a lower bound is shown on conservative algorithms of Delay * 
F*” = f2{{nb/aff~t). This linear lower bound is discouraging, especially com- 
pared with the cost of epidemic-style diffusion of updates in benign-failure 
environment^, which has Delay * F’’’’ = O(logn). Such efficient diffusion would 
have been possible in a Byzantine setting if signatures were utilized to distin- 
guish correct from spurious updates, but as already discussed, deploying digital 
signatures is ruled out in our setting. It appears that the advantages achieved 
by avoiding digital signatures come at a grave price. 

^ In epidemic-style diffusion we refer to a method whereby in each round, each active 
replica chooses a target replica independently at random and sends to it the update. 
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Fortunately, in this paper we propose an approach for diffusion in full 
Byzantine settings that is able to circumvent the predictions of [IM M us- 
ing a fundamentally different approach. Our proposed liberal algorithm has 
Delay * F™ — 0{b + log n) and enjoys the same simplicity of epidemic-style 
propagation. The main price paid is in the size of messages used in the proto- 
col. Although previous analyses ignored the size of messages, we note that our 
method requires additional communication space of nO(iog(&-i-iogn)) p^j, message. 
In terms of delay, we prove our algorithm optimal by showing a general lower 
bound of 17(6^^^=^ -|- log -) on the delay for the problem model. 

Our liberal approach works as follows. As before, a replica starts the protocol 
as active if it receives an update as input. Other replicas start as passive. Active 
replicas send copies of the update to other replicas at random. When a passive 
replica receives a copy of an update through another replica, it becomes hesi- 
tant for this update. A hesitant replica sends copies of the update, along with 
information about the paths it was received from, to randomly chosen replicas. 
Finally, when a replica receives copies of an update over b vertex-disjoint paths, 
it believes its veracity, and becomes active for it. 

It should first be noted that this method does not allow corrupt updates to 
be accepted by good replicas. Intuitively, this is because when an update reaches 
a good replica, the last corrupt replica it passed through is correctly expressed in 
its path. Therefore, a spurious update cannot reach a good replica over b disjoint 
paths. 

It is left to analyze the diffusion time and message complexity incurred by 
the propagation of these paths. Here, care should be taken. Since we show that 
a lower bound of 17 ( 6 ^ 2 ^ _|_ log holds on the delay, then if path-lengthening 
proceeds uncontrolled throughout the algorithm, then messages might carry up 
to 0{b^) paths. For a large 6, this would be intolerable, and also too large to 
search for disjoint paths at the receivers. Another alternative that would be 
tempting is to try to describe the paths more concisely by simply describing the 
graph that they form, having at most 0{nb) edges. Here, the problem is that 
corrupt replicas can in fact create spurious updates that appear to propagate 
along b vertex-disjoint paths in the graph, despite the fact that there were no 
such paths in the diffusion. 

Our solution is to limit all paths to length log^. That is, a replica that 
receives an update over a path of length log does not continue to further prop- 
agate this path. Nevertheless, we let the propagation process run for 0(6-1- log 
rounds, during which paths shorter than log ^ continue to lengthen. This pro- 
cess generates a dense collection of limited length paths. Intuitively, the diffusion 
process then evolves in two stages. 

1. First, the diffusion of updates from the a active starting points is carried 
as an independent epidemic-style process, so each one of the active replicas 
establishes a group of hesitant replicas to a vicinity of logarithmic diameter. 

2. Each log-diameter vicinity of active replicas now directly targets (i.e., with 
paths of length 1) the remaining graph. With careful analyses it is shown 
that it takes additional 0(6) rounds for each replica to be targeted directly 
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by some node from b out of the a disjoint vicinities of active replicas, over b 
disjoint paths. 

Throughout the protocol, each replica diffuses information about up to 0{{b+ 
log f different paths, which is the space overhead on the communication. 



1.1 Related Work 



Diffusion is a fundamental mechanism for driving replicated data to a consistent 
state in a highly decentralized system. Our work optimizes diffusion protocols in 
systems where arbitrary failures are a concern, and may form a basis of solutions 
for disseminating critical information in this setting. 

The study of Byzantine diffusion was initiated in |M M M2|. That work es- 
tablished a lower bound for conservative algorithms, and presented a family 
of nearly optimal conservative protocols. Our work is similar to the approach 
taken in |MMB,99j in its use of epidemic-style propagation, and consequently 
in its probabilistic guarantees. It also enjoys similar simplicity of deployment, 
especially in real-life systems where partially-overlapping universes of replicas 
exist for different data objects, and the propagation scheme needs to handle 
multiple updates to different objects simultaneously. The protocols of jMMP!99j 
were further improved, and indeed, the lower bound of |MMP!99] circumvented 
to some extent, in 



but their general worst case remained the same. 

The fundamental distinction between our work and the above works is in 
the liberal approach we take. With liberal approach, we are able to completely 
circumvent the lower bound of |M M b,99] . albeit at the cost of increased message 
size. An additional advantage of liberal methods is that in principle, they can 
provide update diffusion in any 6-connected graph (though some topologies may 
increase the delay of diffusion), whereas the conservative approach might simply 
fail to diffuse updates if the network is not fully connected. The investigation 
of secure information diffusion in various network topologies is not pursued fur- 
ther in this paper however, and is a topic of our ongoing research. The main 
advantage of the conservative approach is that spurious updates generated by 
corrupt replicas cannot cause good replicas to send messages containing them; 
they may however inflict load on the good replicas in storage and in receiving 
and processing these updates. Hence, means for constraining the load induced 
by corrupt replicas must exist in both approaches. 

While working on this paper, we learned that our liberal approach to se- 
cure information diffusion has been independently investigated by Minsky and 
Schneider |MS01j . Their diffusion algorithms use age to decide which updates 
to keep and which to discard, in contrast to our approach which discards based 
on the length of the path an update has traversed. Also, in the algorithms of 
[IMSnij . replicas pull updates, rather than push messages to other replicas, in 
order to limit the ability of corrupt hosts to inject bogus paths into the system. 
Simulation experiments are used in jMSOI j to gain insight into the performance 
of those protocols; a closed-form analysis was sought by Minsky and Schneider 
but could not be obtained. Our work provides the foundations needed to analyze 
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liberal diffusion methods, provides general lower bounds, and proves optimality 
of the protocol we present. 

Prior to the above works, previous work on update diffusion focused on sys- 
tems that can suffer benign failures only. Notably, Demers et al. 
performed a detailed study of epidemic algorithms for the benign setting, in 
which each update is initially known at a single replica and must be diffused to 
all replicas with minimal traffic overhead. One of the algorithms they stud- 
ied, called anti-entropy and apparently initially proposed in , was 

adopted in Xerox’s Clearinghouse project (see IIDOH-l-871 1 and the Ensemble 
system IRHO+flfll . Similar ideas also underly IP-Multicast II )ee8hl and MUSE 
(for USENET News propagation) ILOMhdl . The algorithms studied here for 
Byzantine environments behave fundamentally differently from any of the above 
settings where the system exhibits benign failures only. 

Prior studies of update diffusion in distributed systems that can suffer Byzan- 
tine failures have focused on single-source broadcast protocols that provide re- 
liable communication to replicas and replica agreement on the broadcast value 
(e.g., sometimes with additional ordering guaran- 

tees on the delivery of updates from different sources 

(e.g., jHeit)4l( ;A8I E)5IIVI IVID5IK M Mf)8l( Mjhhj b The problem that we consider 
here is different from these works in the following ways. First, in these prior 
works, it is assumed that one replica begins with each update, and that this 
replica may be faulty — in which case the good replicas can agree on an arbitrary 
update. In contrast, in our scenario we assume that at least a threshold a > & of 
good replicas begin with each update, and that only these updates (and no arbi- 
trary ones) can be accepted by good replicas. Second, these prior works focus on 
reliability, i.e., guaranteeing that all good replicas (or all good replicas in some 
agreed-upon subset of replicas) receive the update. Our protocols diffuse each 
update to all good replicas only with some probability that is determined by the 
number of rounds for which the update is propagated before it is discarded. Our 
goal is to devise diffusion algorithms that are efficient in the number of rounds 
until the update is expected to be diffused globally and the load imposed on 
each replica as measured by the number of messages it receives in each round. 



2 Preliminaries 



Following the system model of [fMMB DDj , our system consists of a universe S oin 
replicas to which updates are input. Strictly less than some known threshold b of 
the replicas could be corrupt, a corrupt replica can deviate from its specification 
arbitrarily (Byzantine failures). Replicas that always satisfy their specifications 
are good. We do not allow the use of digital signatures by replicas, and hence, 
our model is the full information-theoretic Byzantine model. 

Replicas can communicate via a completely connected point-to-point net- 
work. Communication channels between good replicas are reliable and authen- 
ticated, in the sense that a good replica pi receives a message on the communi- 
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cation channel from another good replica pj if and only if pj sent that message 
to Pi. 

Our work is concerned with the diffusion of updates among the replicas. 
Each update u is introduced to an initial set lu containing at least a > b good 
replicas, and is then diffused to other replicas via message passing. Replicas in 
are considered active for u. The goal of a diffusion algorithm is to make all good 
replicas active for u, where a replica p is active for u only if it can guarantee its 
veracity. 

Our diffusion protocols proceed in synchronous rounds. For simplicity, we 
assume that each update arrives at each replica in simultaneously, i.e., in the 
same round at each replica in This assumption is made purely for simplicity 
and does not impact on either the correctness or the speed of our protocol. In 
each round, each replica selects one other replica to which it sends information 
about updates as prescribed by the diffusion protocol. That is, the Fan-out, 
is assumed to be 10 A replica receives and processes all the messages sent to it 
in a round before the next round starts. 

We consider the following three measures of quality for diffusion protocols: 

Delay: For each update, the delay is the worst-case expected number of rounds 
from the time the update is introduced to the system until all good replicas 
are active for update. Formally, let ? 7 „ be the round number in which update 
u is introduced to the system, and let r“ be the round in which a good 
replica p becomes active for update u. The delay is F[maxp^c'{'rp } — Vu], 
where the expectation is over the random choices of the algorithm and the 
maximization is over good replicas p, all failure configurations C containing 
fewer than b failures, and all behaviors of those corrupt replicas. In particular, 
maxp c{r“} is reached when the corrupt replicas send no updates, and our 
delay analysis applies to this case. 

Fan-in: The fan-in measure, denoted by F*", is the expected maximum num- 
ber of messages that any good replica receives in a single round from good 
replicas under all possible failure scenarios. Formally, let be the number of 
messages received in round i by replica p from good replicas. Then the fan-in 
in round i is E[maxp^c{Pp}]^ where the maximum is taken with respect to 
all good replicas p and all failure configurations C containing fewer than b 
failures. Amortized fan-in is the expected maximum number of messages re- 
ceived over multiple rounds, normalized by the number of rounds. Formally, 
a fc-amortized fan-in starting at round I is E[m.a,Xp c{Y^^it!i Pp/k}]. We em- 
phasize that fan-in and amortized fan-in are measures only for messages from 
good replicas. 

Communication complexity: The maximum amount of information pertain- 
ing to a specific update, that was sent by a good replica in a single message. 
The maximum is taken on all the messages sent (in any round), and with 
respect to all good replicas and all failure configurations C containing fewer 
than b failures. 

^ We could expand the treatment here to relate to as a parameter, but chose not 
to do so for simplicity. 
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Note that what interests us is the expected value of the measures. When we 
make statements of the type ’’within an expected f{r) rounds, P{ry^ (for some 
predicate P, and function /), we mean that if we define X as a random variable 
that measures the time until P{r) is true then E{X) = f{r). 

The following bound presents an inherent tradeoff between delay and fan- 
in for conservative diffusion methods (Definition P), when the initial set is 
arbitrarily designated: 

Theorem 1 ( [MMH.99J I. Let there be a conservative diffusion algorithm. De- 
note by D the algorithm’s delay, and by its D -amortized fan-in. Then 
jjpin _ f2{bn/a), for b > 21ogn. 

One contribution of the present work is to show that the lower bound of 
Theorem Q for conservative diffusion algorithms, does not hold once inactive 
replicas are allowed to participate in the diffusion. 

3 Lower Bounds 

In this section we present lower bounds which apply to any diffusion method in 
our setting. Our main theorem sets a lower bound on the delay. It states that 
the propagation time is related linearly to the number of corrupt replicas and 
logarithmically to the total number of replicas. 

We start by showing the relation between the delay and the number of corrupt 
players. 

Lemma 1. Let there be any diffusion algorithm in our setting. Let D denote 
the algorithm’s delay. Then D = 

Proof. Since it is possible that there are 6—1 corrupt replicas, each good replica 
who did not receive the update initially as input must be targeted directly by 
at least 6 different other replicas, as otherwise corrupt replicas can cause it 
to accept an invalid update. Since only a replicas receive the update initially, 
at least b{n — a) direct messages must be sent. As Pout = 1 and there are n 
replicas, at most n messages are sent in each round. Therefore it takes at least 
6^2^ rounds to have b{n — a) direct messages sent. 

We now show the relationship of the delay to the number of replicas. 

Lemma 2. Let there be any diffusion algorithm in our setting. Let D denote 
the algorithm’s delay. Then D = f?(log ^). 

Proof. Each replica has to receive a copy of the update. Since Pout = 1, the 
number of replicas who receive the update up to round t is at most twice the 
number of replicas who received the update up to round t — 1. Therefore at the 
final round tend, when all replicas received the update, we have that a = n 
or tend = log . 
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The following theorem immediately follows from the previous two lemmas: 

Theorem 2. Let there be any diffusion algorithm in our setting. Let D denote 
the algorithm’s delay. Then D — + log ^). 

Remark 1. We will deal primarily in the case where a < ^ as otherwise the 
diffusion problem is relatively simple. In particular, if a > then we can use 
the algorithm of pVTMT},99| to yield delay of 0{b), which is optimal for = 1. 
When a < ^ our lower bound is equal to 17(6+ log ^), which is met by the 
propagation algorithm presented below. 



Remark 2. We note that in order for an update to propagate successfully we 
must have that a > b. From this, it immediately follows that 6 < ^. However, 
below we shall have a tighter constraint on 6 that stems from our diffusion 
method. We note that throughout this paper no attempt is made to optimize 
constants. 

4 The Propagation Algorithm 

In this section we present an optimal propagation algorithm that matches the 
lower bound shown in section 0 

In our protocol, each replica can be in one of three states for a particular 
update: passive, hesitant or active. Each replica starts off either in the active 
state, if it receives the update initially as input, or (otherwise) in the passive 
state. In each round, the actions performed by a replica are determined by its 
state. The algorithm performed in a round concerning a particular update is as 
follows: 



— An active replica chooses a random replica and sends the update to it. (Com- 
pared with the actions of hesitant replicas below, the lack of any paths at- 
tached to the update conveys the replica’s belief in the update’s veracity.) 

— A passive or hesitant replica p that receives the update from q, with various 
(possibly empty) path descriptions attached, appends q to the end of each 
path and saves the paths. If p was passive, it becomes hesitant. 

— A hesitant replica chooses a random replica and sends to it all vertex-minimal 
paths of length < log ^ over which the update was received. 

— A hesitant replica that has 6 vertex disjoint paths for the update becomes 
active. 



A couple of things are worth noting here. First, it should be clear that the al- 
gorithm above executes simultaneously for all concurrently propagating updates. 
Second, any particular update is propagated by replicas for a limited number of 
rounds. The purpose of the analysis in the rest of the paper is to determine the 
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number of rounds needed for the full propagation of an update. Finally, some 
optimizations are possible. For example, a hesitant replica p that has b vertex 
disjoint paths passing through a single vertex q (i.e., disjoint between q and p) 
can unify the paths to be equivalent to a direct communication from the vertex 

q- 

We now prove that our algorithm is correct. 

Lemma 3. If a good replica becomes active for an update then the update was 
initially input to a good replica. 

Proof. There are two possible ways in which a good replica can become active for 
an update. The first possibility is when the replica receives the update initially 
as input. In this case the claim certainly holds. 

The second possibility is when the replica receives the update over b vertex 
disjoint paths. We say that a corrupt replica controls a path if it is the last 
corrupt replica in the path. Note that for any invalid update which was generated 
by corrupt replica(s), there is exactly one corrupt replica controlling any path 
(since by definition the update was created by the corrupt replicas) . Since good 
replicas follow the protocol and do not change the path(s) they received, the 
corrupt controlling replica will not be removed from any path by any subsequent 
good replica receiving the update. As there are less than b corrupt replicas and 
the paths are vertex disjoint there are less than b such paths. As a good replica 
becomes active for an update when it receives the update over b disjoint paths, 
at least one of the paths has only good replicas in it. Therefore the update was 
input to a good replica. 

The rest of this paper will prove the converse direction. If an update was 
initially input to a > & good replicas then within a relatively small number of 
rounds, all good replicas will receive the update with high probability. 

5 Performance Analysis 

In this section, we proceed to analyze the performance of our algorithm. Our 
treatment is based on a communication graph that gradually evolves in the 
execution. We introduce some notation to be used in the analysis below. At 
every round r, the communication graph Gr = (V, Er) is defined on (good) 
vertices V such that there is a (directed) edge between two vertices if one sent 
any message to the other during round r. We denote by Ng{I) the neighborhood 
of I (singleton or set) in G. We denote by || p, q |1g the shortest distance between 
p and q in G. In the analysis below, we use vertices and replicas interchangeably. 

Our proof will make use of gossip -circles that gradually evolve around active 
replicas. Intuitively, the gossip-circle G {p, d, r) of a good active replica is the set 
of good replica that heard the update from p over good paths (comprising good 
replicas) of length up to d in r rounds. Formally: 
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Definition 2. Let p be some good repliea which is active for the update u. Let 
{Gj = be the set of communication graphs of r rounds of the 

execution of vertices in V. Recall that Nq(I) denotes the set of all neighbors in 
a graph G of nodes in I. We then define gossip circles ofp in r rounds inductively 
as follows: 

Cv{p,0,r) = {p} 

\/l < d < r : 

Gv{p, d, r) = Gv{p, d - 1, r) U 

€ NcACvip,d-l,r)) :|| p,q \\cv(p,d-i,r)< min{d-l,log f-1}} 

When V is the set of good replicas, we omit it for simplicity. Note that the gos- 
sip circle G{p,d,r) is constrained by definition to have radius < min{d, log ^}. 

The idea behind our analysis is that any b initial active good replicas spread 
paths that cover disjoint low-diameter gossip-circles of size Hence, it is suf- 
ficient for any replica to be directly targeted by some replica from each one of 
these sets in order to have b vertex-disjoint paths from initial replicas. 

We first show a lemma about the spreading of epidemic style propagation 
with bounded path length. Without bounding paths, the analysis reduces to 
epidemic-style propagation for benign environment, as shown in im(iH+87i . 



Lemma 4. Let p G lu be a good replica, and let d < log ^ . Assume there are no 
hen within an expected r > d rounds, \ 

‘I 

2 J ■ 



corrupt replicas. Then within an expected r > d rounds, \C{p,d,r)\ > min{(|)'^- 

{r-d)iir-^ 



Proof. The proof looks at an execution of r rounds of propagation in two parts. 
The first part consists of d rounds. In this part, the set of replicas that received 
a copy of u (equivalently, received a copy of u over paths of length < d) , grows 
exponentially. That is, in d rounds, the update propagates to (l)'^ replicas. The 
second part consists of the remaining r — d rounds. This part makes use of the 
fact that at the end of the first part, an expected replicas receive a copy of 

u over paths of length < d. Hence, in the second part, a total of (r — d) x 
replicas receive u. 

Formally, let denote the number of replicas that received u from p over 
paths of length < d by round j, i.e., mj = \C{p,d,j)\. 

Let j < d. So long as the number of replicas reached by paths of length < d 
does not already exceed then in round j -\-l each replica in G{p, d,j) targets 
a new replica with probability > Therefore, the expected number of messages 
sent until new replicas are targeted is at most ruj. Furthermore, since at least 
mj messages are sent in round j, this occurs within an expected one round. We 
therefore have that the expected time until (1)^^ replicas receive u over paths of 
length < d is at most d. 

From round d -|- 1 on, we note that at least half of ma received u over paths 
of length strictly less than d. Therefore, in each round j > d, there are at least 
i X (l)"^ replicas forwarding u over paths of length < d. So long as mj < 
then in round j each of these replicas targets a new replica with probability > 
Therefore, the expected number of messages sent until < i x | x (1)"^ 
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new replicas are targeted is at most | x which occurs in an expected one 
round. 

Putting the above together, we have that within an expected r rounds, (|)'^ + 
{r — d) X replicas are in C{p,d,r). 

Since the choice of communication edges in the communication graph is made 
at random, we get as an immediate corollary: 

Corollary 1. Let V' Q V be a set of vertiees, eontaining all eorrupt ones, 
ehosen independently from the choices of the algorithm, such that \V'\ < Let 
p G lu be a good replica, and let d < log^. Then within an expected 3r > d 
rounds, \C{y\V')ip,d,‘ir)\ > min{(|)‘' + (r - |}. 

We now use corollary Dto build b disjoint gossip circles of initial replicas, and 
wish to proceed with the analysis of the number of rounds it takes for replicas 
to be targeted by these disjoint sets. As edges in the communication graph are 
built at random, a tempting approach would be to treat this as a simple coupon 
collector problem on the b gossip-circles where each replica wishes to “collect a 
member” of each of these sets by being targeted with an edge from it. With this 
simplistic analysis, it would take each replica 0{blogb) rounds to collect all the 
coupons, and an additional logarithmic factor in n for all replicas to complete. 
The resulting analysis would provide an upper bound of 0(6(log &)(log n)) on 
the delay. Although this is sufficient for small b, for large b we wish to further 
tighten the analysis on the number of rounds needed for diffusion. 

The approach we take is to gradually adapt the size of the disjoint gossip- 
circles as the process evolves, and to show that the expected amount of time until 
all sets are connected to a replica remains constant. More precisely, we show that 
in an expected 0{b) rounds, a replica has edges to half of b gossip-circles of size 

We then look at the communication graph with all of the vertices in the 
paths of the previous step(s) removed. We show that in time 0(6/2), a replica 
has edges to gossip-circles of size ^ of half of the | remaining initial replicas. 
And so on. In general, we have an inductive analysis for k = O..log6. For each 
k, we denote 6fc = ^. For step k of the analysis, we show that in time 0(6^,), a 
replica has disjoint paths of length < log to ^ of the initial replicas. Hence, 
in total time 0(6), a replica connects to 6 initial replicas over disjoint paths, all 
of length < log ^ (and hence, not exceeding the algorithm’s path limit). 

Our use of Corollary [D is as follows. Let 6fc = ^, and let V denote a set 
of vertices we wish to exclude from the graph, where \V'\ < Then we have 
that within an expected 3r = 3(6 -I- 2 log rounds, each initial good replica 
has a gossip circle of diameter d = maxjl, 21og whose size is at least 

(6+d-d)(§)(-’*-4) > 

We now use this fact to designate disjoint low-diameter gossip circles around 
6 good replicas in 
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Lemma 5. Let I Q lu be a subset of initial good replieas of size bk- Let W' be a 
subset of replieas with \ W'\ < Denote by d = max{l,21og Then within 

an expeeted 3r = 3(6+ d) rounds there exist disjoint subsets {Ci}i^i eontaining 
no vertices of W' , such that each Ci C C(y\wi-^{i,d,'Hr) , and such that each 



Proof The proof builds these sets for I inductively. Suppose that Ci, Ci_i, 
for 0 < i < bk, have been designated already, such that for all 1 < j < z — 1 , 
we have that Cj C C{j,d,3r) and \Cj\ = Denote by C = Uj=i 

Then the total number of vertices in y' = C U W is at most + (* — 1) < 

^ + 6 fc^ — f- From Corollary ^ we get that within an expected 3r rounds, 
and without using any vertex in V' , the gossip circle Cy\y>(i, d, 3r) contains at 

least (6 + 2 log -2 log 5 ^) (|) ^ 



be a subset of C{i,d,3r) of size and the lemma follows. 



We now analyze the delay until a vertex has direct edges to these bk disjoint 
sets. 



Lemma 6 . Let v £ V be a good replica. Let bk = ^ as before and let 
be disjoint sets, each of size and diameter 21 og (as determined by 
Lemma Ell. Then within an expected 46^ rounds there are edges from ^ of the 
sets to V. 



Proof. The proof is simply a coupon collector analysis of collecting ^ out of bk 
coupons, where in epoch i, for 1 < z < ^, the probability of collecting the z’th 
new coupon in a round is precisely the probability of v being targeted by a new 

set, i.e., -. The expected number of rounds until completion is therefore 

Ei=i..(hfc/2) ^ ^ 46fe. 

We are now ready to put these facts together to analyze the delay that a 
single vertex incurs for having disjoint paths to 6 initial replicas. 



Lemma 7. Let v £ V be a good replica. Suppose that b < ^. Then within an 
expected 5(6 + log j) rounds there are b vertex disjoint paths of length < log 
from lu to V. 

Proof. We prove by induction on 6 fc = ^, for k = O..(log 6 — 1). To begin the 
induction, we set 6 q = 6 . By Corollary^ within an expected 6+2 log stages, 
there are bo — b disjoint sets (of radius 2 log 5 ^) whose size is By LemmaEl 
within 46o rounds, v has direct edges to ^ of these sets. Hence, it has disjoint 
paths of length < 2 log + 1 to ^ initial replicas. These paths comprise at 
most ^(2 log + 1) good vertices. 

For step 0 < k < (log 6 ) of the analysis, we set 6 ^ = ^. The set of vertices 
used in paths so far, together with all the corrupt vertices, total less than 




6 X bk' 



+ 1 ) < 6 +^ 



k'<k 
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By our assumption that b < we get that the total number of vertices used 
until step k is less than ^ . Hence, in each step 0 < k < log b, we apply Corollary ^ 
to form bk disjoint sets (of radius 2 log whose size is ^ each. By LemmaEl 

half of these sets have direct edges to v within an expected 46^ rounds. 

In total, we showed that in expected maxo<fc<iogh{4&fc+&+2 log rounds, 
V has disjoint paths (of length at most log to b initial replicas. 

We now wish to bound the time when all of the nodes have b vertex disjoint 
paths to A tempting approach would be to use a Chernoff bound, but the 
analysis would then require an additional logarithmic factor in n. This factor 
can be avoided by utilizing the fact that after a 0(logn + 6) rounds there exist a 
fraction of the replicas who are active for the update. Finally, propagation from 
a linear set is easily done. 

Lemma 8. Let c > 1 be a eonstant. The expected time until (n — 6) (l — 
replicas become active is 0{b + logn). 

Proof. By Lemma 0 the expected time for a replica to become active is 5(6 + 
log j). Hence, the probability that a replica becomes active in c x 5(6 + log 
rounds or more is less than Hence, within an expected c x 5(6 + log rounds 
the number of active replicas is at least (n — 6)(l — ^). 

We now choose a particular value for c in the previous lemma. We note that 
we choose an arbitrary value without attempting to minimize the constants. 

For c = 2, within an expected 10(6 + log rounds there are ^{n — b) replicas 
who are active for the update. By reusing the supposition 6 < ^ from Lemma|71 
we get that — b) > ^{n — > |n. This means that there are at least |n 

good replicas who are active for the update. 

Lemma 9. If at least |n good replicas are active for the update then within an 
expected 0(6 + logn) rounds all of the replicas become active for the update. 

Proof. Fix any replica and let Yi be the number of updates from active replicas 
that the replica receives in round i. Let + be the number of updates that the 
replica receives in r rounds, i.e., Y = X^i=i ^i- linearity of expectation, 

E(Y) = X)i=i ^O^i) ^ Using a Chernoff bound we have that Pr[Y < ^] < 
e“i5. Therefore if r = 48 logn + 26 we have that Pr[Y < 

Theorem 3. The algorithm terminates in an expected 0(log n + 6) rounds. 

Proof. By corollary|3 and lemma|H|it follows that within 0(logn + 6) rounds 0.8 
of the replicas become active. From Lemma M within an additional 0(log n + 6) 
rounds all of the replicas become active. 

Therefore, our delay matches the lower bound of theorem |21 

We conclude the analysis with a log amortized F*" analysis and a commu- 
nication complexity bound. The logn amortized F*" of our algorithm as shown 
in |MMk99| is 1. 
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In order to finish the analysis the communication complexity (which also 
bounds the required storage size) must be addressed. Each vertex v G V receives 
at most 0(5 + log )() sets of paths. Paths are of length at most log Therefore, 
the communication overhead per message can be bounded by 0(5 + log f )*°® ^ = 

^n^O(log(6+logn))^ 

This communication complexity can be enforced by good replicas even in the 
presence of faulty replicas. A good replica can simply verify that (a) the length 
of all paths in any incoming message does not exceed log and that (b) the 
out-degree of any vertex does not exceed 0(5 + log Any violation of (a) or 
(b) indicates that the message was sent by a faulty replica, and can be safely 
discarded. 

6 Conclusions and Future Work 

This paper presented a round-efficient algorithm for disseminating updates in 
a Byzantine environment. The protocol presented propagates updates within 
an expected 0(5 + Ign) rounds, which is shown to be optimal. Compared with 
previous methods, the efficiency here was gained at the cost of an increase in 
the size of messages sent in the protocol. Our main direction for future work is 
to reduce the communication complexity, which was cursorily addressed in the 
present work. 
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Abstract. We generalize the notion of slice introduced in our earlier 
paper 0. A slice of a distributed computation with respect to a global 
predicate is the smallest computation that contains all consistent cuts of 
the original computation that satisfy the predicate. We prove that slice 
exists for all global predicates. We also establish that it is, in general, 
NP-complete to compute the slice. An optimal algorithm to compute 
slices for special cases of predicates is provided. Further, we present an 
efficient algorithm to graft two slices, that is, given two slices, either com- 
pute the smallest slice that contains all consistent cuts that are common 
to both slices or compute the smallest slice that contains all consistent 
cuts that belong to at least one of the slices. We give application of slic- 
ing in general and grafting in particular to global property evaluation 
of distributed programs. Finally, we show that the results pertaining to 
consistent global checkpoints rrmsi can be derived as special cases of 
computation slicing. 



1 Introduction 

Writing distributed programs is an error prone activity; it is hard to reason about 
them because they suffer from the combinatorial explosion problem. Testing and 
debugging, and software fault-tolerance is an important way to ensure the reli- 
ability of distributed systems. Thus it becomes necessary to develop techniques 
that facilitate the analysis of distributed computations. Various abstractions 
such as predicate detection (e.g., PEQ) and predicate control nmn have 
been defined to carry out such analysis. 

In our earlier paper we propose another abstraction, called computation 
slice, which was defined as: a slice of a distributed computation with respect to a 
global predicate is another computation that contains those and only those con- 
sistent cuts (or snapshots) of the original computation that satisfy the predicate. 
In |5|, we also introduce a class of global predicates called regular predicates: a 
global predicate is regular iff whenever two consistent cuts satisfy the predicate 

* supported in part by the NSF Grants ECS-9907213, CCR-9988225, Texas Education 
Board Grant ARP-320, an Engineering Foundation Fellowship, and an IBM grant. 

J. Welch (Ed.): DISC 2001, LNCS 2180, pp. 7S-PI^ 2001. 
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Fig. 1. (a) A computation and (b) its slice with respect to ( 2:1 ^ 1) A (13 ^ 3). 



then the cuts given by their set intersection and set union also satisfy the predi- 
cate. We show that slice exists only for regular predicates and present an efficient 
algorithm to compute the slice. The class of regular predicates is closed under 
conjunction. 

A limitation of the definition of slice in [0( is that slice exists only for a 
specific class of predicates. This prompted us to weaken the definition of slice 
to the smallest computation that contains all consistent cuts of the original 
computation that satisfy the predicate. In this paper, we show that slice exists 
for all global predicates. 

The notion of computation slice is analogous to the concept of program slice 
m- Given a program and a set of variables, a program slice consists of all 
statements in the program that may affect the value of the variables in the 
set at some given point. A slice could be static m or dynamic (for a specific 
program input) |0|. The notion of a slice has been also extended to distributed 
programs [H| . Program slicing has been shown to be useful in program debugging, 
testing, program understanding and software maintenance A slice can 

significantly narrow the size of the program to be analyzed, thereby making the 
understanding of the program behaviour easier. We expect to reap the same 
benefit from a computation slice. 

Computation slicing is also useful for reducing search space for NP-complete 
problems such as predicate detection 1,11711 511 Given a distributed computa- 
tion and a global predicate, predicate detection requires finding a consistent cut 
of the computation, if it exists, that satisfies the predicate. It is a fundamen- 
tal problem in distributed system and arises in contexts such as software fault 
tolerance, and testing and debugging. 

As an illustration, suppose we want to detect the predicate {x\ *^2 - 1 - 2:3 < 5) 
A(a;i ^ 1) A (2:3 ^ 3) in the computation shown in Fig. [Ha). The computation 
consists of three processes P\, P 2 and P 3 hosting integer variables x\, X 2 and 2 : 3 , 
respectively. The events are represented by solid circles. Each event is labeled 
with the value of the respective variable immediately after the event is executed. 
For example, the value of variable xi immediately after executing the event c 
is —1. The first event on each process initializes the state of the process and 
every consistent cut contains these initial events. Without computation slicing. 
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we are forced to examine all consistent cuts of the computation, twenty eight in 
total, to ascertain whether some consistent cut satisfies the predicate. Alterna- 
tively, we can compute a slice of the computation with respect to the predicate 
(xi ^ 1) A (x 3 < 3) as portrayed in Fig. Qb). The slice is modeled by a directed 
graph. Each vertex of the graph corresponds to a subset of events. If a vertex is 
contained in a consistent cut, the interpretation is that all events corresponding 
to the vertex are contained in the cut. Moreover, a vertex belongs to a consis- 
tent cut only if all its incoming neighbours are also present in the cut. We can 
now restrict our search to the consistent cuts of the slice which are only six in 
number, namely {a, e, /, u, u}, {a, e, /, u, v, b}, {a, e, /, u, v, tc}, {a, e, /, u, v, b, w}, 
{a, e, /, u, V, w, g} and {a, e, /, m, u, b, w, g}. The slice has much fewer consistent 
cuts than the computation itself — exponentially smaller in many cases — resulting 
in substantial savings. 

We also show that the results pertaining to consistent global checkpoints 
can be derived as special cases of computation slicing. In particular, 
we furnish an alternate characterization of the condition under which individual 
local checkpoints can be combined with others to form a consistent global check- 
point (consistency theorem by Netzer and Xu ^5)- ^ local checkpoints 

can belong to the same consistent global snapshot iff the local checkpoints in 
the set are mutually consistent (including with itself) in the slice. Moreover, the 
R-graph (rollback-dependency graph) defined by Wang (El is a special case of 
the slice. The minimum and maximum consistent global checkpoints that contain 
a set of local checkpoints |E| can also be easily obtained using the slice. 

In summary, this paper makes the following contributions: 

— In Section we generalize the notion of computation slice introduced in 
our earlier paper |^. We show that slice exists for all global predicates in 
Section E] 

— We establish that it is, in general, NP-complete to determine whether a 
global predicate has a non-empty slice in Section 0 

— In Section^ an application of computation slicing to monitoring global prop- 
erties in distributed systems is provided. Specifically, we give an algorithm to 
determine whether a global predicate satisfying certain properties is possibly 
true, invariant or controllable in a distributed computation using slicing. 

— We present an efficient representation of slice in Section Elthat we use later to 
devise an efficient algorithm to graft two slices in Section 0 Grafting can be 
done in two ways. Given two slices, we can either compute the smallest slice 
that contains all consistent cuts that are common to both slices or compute 
the smallest slice that contains all consistent cuts that belong to at least 
one of the slices. An efficient algorithm using grafting to compute slice for 
complement of a regular predicate, called co-regular predicate, is provided. 
We also show how grafting can be used to avoid examining many consistent 
cuts when detecting a predicate. 

— We provide an optimal algorithm to compute slices for special cases of regular 
predicates in Section 0 In our earlier paper 0, the algorithm to compute 
slices has 0{N‘^\E\) time complexity, where N is the number of processes 
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and E is the set of events in the distributed system. The algorithm presented 
in this paper has 0(\E\) complexity which is optimal. 

— Finally, in Section^ we show that the results pertaining to consistent global 
checkpoints mm can be derived as special cases of computation slicing. 

Due to lack of space, the proofs of lemmas, theorems and corollaries, and 
other details have been omitted. Interested reader can find them in the technical 
report m 

2 Model and Notation 

2.1 Lattices 

Given a lattice, we use n and U to denote its meet (infimum) and join (supre- 
mum) operators, respectively. A lattice is distributive iff meet distributes over 
join. Formally, a □ (5 U c) = (a □ 6) U (a □ c). 

2.2 Directed Graphs: Path- and Cut- Equivalence 

Traditionally, a distributed computation is modeled by a partial order on a set of 
events. We use directed graphs to model both distributed computation and slice. 
Directed graphs allow us to handle both of them in a convenient and uniform 
manner. 

Given a directed graph G, let V(G) and E(G) denote its set of vertices and 
edges, respectively. A subset of vertices of a directed graph form a consistent cut 
iff the subset contains a vertex only if it contains all its incoming neighbours. 
Formally, 

G is a consistent cut of G A (Ve, / S V(G) : (e, /) G E(G) : / G G => e G G) 

Observe that a consistent cut either contains all vertices in a cycle or none of 
them. This observation can be generalized to a strongly connected component. 
Traditionally, the notion of consistent cut (down-set or order ideal) is defined 
for partially ordered sets [3- Here, we extend the notion to sets with arbitrary 
orders. Let C(G) denote the set of consistent cuts of a directed graph G. Observe 
that the empty set 0 and the set of vertices V(G) trivially belong to C(G). We 
call them trivial consistent cuts. The following theorem is a slight generalization 
of the result in lattice theory that the set of down-sets of a partially ordered set 
forms a distributive lattice jS]. 

Theorem 1. Given a directed graph G, (C(G);C) forms a distributive lattice. 

The theorem follows from the fact that, given two consistent cuts of a graph, 
the cuts given by their set intersection and set union are also consistent. 

A directed graph G is cut- equivalent to a directed graph El iff they have the 
same set of consistent cuts, that is, C(G) = C{H). Let P{G) denote the set of 
pairs of vertices (rt, v) such that there is a path from rt to d in G. We assume 
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that each vertex has a path to itself. A directed graph G is path- equivalent to a 
directed graph i/ iff a path from vertex u to vertex v in G implies a path from 
vertex u to vertex v in H and vice versa, that is, V{G) — V{H). 

Lemma 1. Let G and H he direeted graphs on the same set of vertices. Then, 

V{G) C V{H) = C(G) A C{H) 

Lemma E implies that two directed graphs are cut-equivalent iff they are 
path-equivalent. This is significant because path-equivalence can be verified in 
polynomial-time {\P{G)\ = 0(|V(G)p)) as compared to cut-equivalence which 
is computationally expensive to ascertain in general (|C(G)| = 0(21'^^'^^!)). 

2.3 Distributed Computations as Directed Graphs 

We assume an asynchronous distributed system with the set of processes 
P — {Pi,P2, ■ ■ ■ , Pn}- Processes communicate and synchronize with each other 
by sending messages over a set of reliable channels. 

A local computation of a process is described by a sequence of events that 
transforms the initial state of the process into the final state. At each step, the 
local state of a process is captured by the initial state and the sequence of events 
that have been executed up to that step. Each event is a send event, a receive 
event, or an internal event. An event causes the local state of a process to be 
updated. Additionally, a send event causes a message to be sent and a receive 
event causes a message to be received. We assume the presence of fictitious 
initial and final events on each process Pi, denoted by J_i and T^, respectively. 
The initial event occurs before any other event on the process and initializes the 
state of the process. The final events occurs after all other events on the process. 

Let proc(e) denote the process on which event e occurs. The predecessor and 
successor events of e on proc{e) are denoted by pred{e) and succ{e), respectively, 
if they exist. We denote the order of events on process Pi by '^p^. Let -^p be 
the union of all ^p^s, 1 ^ i ^ N, and '^p denote the reflexive closure of '^p. 

We model a distributed computation (or simply a computation), denoted by 
{E, — >■), as a directed graph with vertices as the set of events E and edges as — >■. 
To limit our attention to only those consistent cuts that can actually occur during 
an execution, we assume that, for any computation (A,— >), V{{E,^)) contains 
at least the Lamport’s happened-before relation mu. We assume that the set of 
all initial events belong to the same strongly connected component. Similarly, 
the set of all final events belong to the same strongly connected component. 
This ensures that any non-trivial consistent cut will contain all initial events 
and none of the final events. As a result, every consistent cut of a computation 
in traditional model is a non-trivial consistent cut of the computation in our 
model and vice versa. Only non-trivial consistent cuts are of real interest to us. 
We will see later that our model allows us to capture empty slices in a very 
convenient fashion. 

A distributed computation in our model can contain cycles. This is because 
whereas a computation in the happened-before model captures the observable 
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(b) 

Fig. 2. (a) A computation and (b) the lattice corresponding to its consistent cuts. 



order of execution of events, a computation in our model captures the set of 
possible consistent cuts. 

A frontier of a consistent cut is the set of those events of the cut whose 
successors, if they exist, are not contained in the cut. Formally, 

frontier {C) = {e G C | succ{e) exists => succ{e) ^ C} 

A consistent cut is uniquely characterized by its frontier and vice versa. Thus 
sometimes, especially in figures, we specify a consistent cut by simply listing 
the events in its frontier instead of enumerating all its events. Two events are 
said to be consistent iff they are contained in the frontier of some consistent 
cut, otherwise they are inconsistent. It can be verified that events e and / are 
consistent iff there is no path in the computation from sMcc(e), if it exists, to / 
and from succ{f), if it exists, to e. Also, note that, in our model, an event can 
be inconsistent with itself. Fig. 0 depicts a computation and the lattice of its 
(non-trivial) consistent cuts. A consistent cut in the figure is represented by its 
frontier. For example, the consistent cut D is represented by {e 2 ,/i}. 

2.4 Global Predicates 

A global predicate (or simply a predicate) is a boolean-valued function defined 
on variables of processes. It is evaluated on events in the frontier of a consistent 
cut. Some examples are mutual exclusion and “at least one philosopher does not 
have any fork” . We leave the predicate undefined for the trivial consistent cuts. 
A global predicate is local iff it depends on variables of at most one process. For 
example, “Pi is in red state” and “Pi does not have the token” . 



3 Slicing a Distributed Computation 

In this section, we define the notion of slice of a computation with respect to a 
predicate. The definition given here is weaker than the definition given in our 
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(a) 




(b) 



Fig. 3. (a) The sublattice of the lattice in Fig. Hb) with respect to the predicate 

((a: < 2) A ( 3 / > 1)) \J{x < 1), and (b) the corresponding slice. 



earlier paper |^. However, slice now exists with respect to every predicate (not 
just specific predicates). 

Definition 1 (Slice). A slice of a computation with respect to a predicate is the 
smallest direeted graph (with minimum number of eonsistent euts) that eontains 
all eonsistent cuts of the original eomputation that satisfy the predieate. 

We will later show that the smallest computation is well-defined for every 
predicate. A slice of computation {E, — >•) with respect to a predicate h is denoted 
by (A, —>■){,. Note that (A, — >•) = (A, — ?>)true- In the rest of the paper, we use the 
terms “computation”, “slice” and “directed graph” interchangeably. 

Fig. EJa) depicts the set of consistent cuts of the computation in Fig. El) a) 
that satisfy the predicate {{x < 2) A (y > 1)) \J {x < 1). The cut shown with 
dashed outline does not actually satisfy the predicate but has to be included 
to complete the sublattice. Fig. Hb) depicts the slice of the computation with 
respect to the predicate. In the figure, all events in a subset belong to the same 
strongly connected component. 

In our model, every slice derived from the computation (A, — >■) will have the 
trivial consistent cuts (0 and E) among its set of consistent cuts. Consequently, 
a slice is empty iff it has no non-trivial consistent cuts. In the rest of the paper, 
unless otherwise stated, a consistent cut refers to a non-trivial consistent cut. 

A slice of a computation with respect to a predicate is lean iff every consistent 
cut of the slice satisfies the predicate. 

4 Regular Predicates 

A global predicate is regular iff the set of consistent cuts that satisfy the predicate 
forms a sublattice of the lattice of consistent cuts |Ej. Equivalently, if two consis- 
tent cuts satisfy a regular predicate then the cuts given by their set intersection 
and set union will also satisfy the predicate. Some examples of regular predi- 
cates are any local predicate and channel predicates such as ‘there are at most 
k messages in transit from Pi to Pf\ The class of regular predicates is closed 
under conjunction j^. We prove elsewhere ^ that the slice of a computation 
with respect to a predicate is lean iff the predicate is regular. We next show how 
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slicing can be used to monitor predicates in distributed systems. Later, we use 
the notion of regular predicates to prove that the slice exists and is well-defined 
with respect to every predicate. 



4.1 Using Slices to Monitor Regular Predicates 

A predicate can be monitored under four modalities, namely possibly, definitely, 
invariant and controllable predicate is possibly true in a compu- 

tation iff there is a consistent cut of the computation that satisfies the predicate. 
On the other hand, a predicate definitely holds in a computation iff it eventually 
becomes true in all runs of the computation (a run is a path in the lattice of 
consistent cuts). The predicates invariant :b and controllable :b are duals of pred- 
icates possibly: b and controllable: b, respectively. Predicate detection normally 
involves detecting a predicate under possibly modality whereas predicate control 
involves monitoring a predicate under controllable modality. Monitoring has ap- 
plications in the areas of testing and debugging and software fault-tolerance of 
distributed programs. 

The next theorem describes how possibly: b, invariant: b and controllable: b 
can be computed using the notion of slice when 6 is a regular predicate. We do 
not yet know the complexity of computing definitely : b when b is regular. 

Theorem 2. A regular predicate is 

1. possibly true in a computation iff the slice of the computation with respect 
to the predicate has at least one non-trivial consistent cut, that is, it has at 
least two strongly connected components. 

2. invariant in a computation iff the slice of the computation with respect to 
the predicate is cut- equivalent to the computation. 

3. controllable in a computation iff the slice of the computation with respect 
to the predicate has the same number of strongly connected components as 
the computation. 



Observe that the first proposition holds for any arbitrary predicate. Since 
detecting whether a predicate possibly holds in a computation is NP-complete in 



general 






, it is, in general, NP-complete to determine whether a predicate 



has a non-empty slice. 



4.2 Regularizing a Non-regular Predicate 

In this section, we show that slice exists and is well-defined with respect to every 
predicate. We know that it is true for at least regular predicates |S|. In addition, 
the slice with respect to a regular predicate is lean. We exploit these facts and de- 
fine a closure operator, denoted by reg, which, given a computation, converts an 
arbitrary predicate into a regular predicate satisfying certain properties. Given 
a computation, let TZ denote the set of predicates that are regular with respect 
to the computation. 




86 



N. Mittal and V.K. Garg 



Definition 2 (reg). Given a predicate b, we define reg (b) as the predicate that 
satisfies the following conditions: 

1. it is regular, that is, reg (b) € TZ, 

2. it is weaker than b, that is, b => reg (b), and 

3. it is stronger than any other predicate that satisfies 1 and 2, that is, 

{'ib' :b' &TI: {b ^ b') => {reg ( 6 ) b')) 

Informally, reg (6) is the strongest regular predicate weaker than b. In general, 
reg (6) not only depends on the predicate b but also on the computation under 
consideration. We assume the dependence on computation to be implicit and 
make it explicit only when necessary. The next theorem establishes that reg (6) 
exists for every predicate. Observe that the slice for b is given by the slice for 
reg (b). Thus slice exists and is well-defined for all predicates. 

Theorem 3. Given a predicate b, reg (5) exists and is well-defined. 

Thus, given a computation {E, — >■) and a predicate b, the slice of {E, — >■) with 
respect to b can be obtained by first applying reg operator to b to get reg {b) 
and then computing the slice of {E, — )>) with respect to reg {b). 

Theorem 4. reg is a closure operator. Formally, 

1. reg {b) is weaker than b, that is, b => reg{b), 

2. reg is monotonic, that is, {b ^ b') => {reg {b) => reg{b')), and 

3. reg is idempotent, that is, reg {reg {b)) = reg{b). 

From the above theorem it follows that |Sl Theorem 2.21], 

Corollary 1. {TZ; =>) forms a lattice. 

The meet and join of two regular predicates bi and &2 is given by 

6i n 62 = bi A 62 
61 U 62 = reg {bi V ^ 2 ) 

The dual notion of reg (6), the weakest regular predicate stronger than b, is 
conceivable. However, such a predicate may not always be unique m- 

5 Representing a Slice 

Observe that any directed graph that is cut-equivalent or path-equivalent to a 
slice constitutes its valid representation. However, for computational purposes, 
it is preferable to select those graphs to represent a slice that have fewer edges 
and can be constructed cheaply. In this section, we show that every slice can 
represented by a directed graph with 0 {\E\) vertices and 0 {N\E\) edges. Fur- 
thermore, the graph can be built in 0 {N'^\E\) time. 
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Fig. 4. The skeletal representation of the slice in Fig.|3Jb) (without self-loops). 



Given a computation a regular predicate b and an event e, let 

Jb{e) denote the least consistent cut of (if,— >■) that contains e and satis- 
fies b. If Jb{e) does not exist then it is set to the trivial consistent cut E. 
Here, we use if as a sentinel cut. Fig. ^ depicts a directed graph that rep- 
resents the slice shown in Fig. Kb). In the figure, Jb(ei) = {_Li,ei,_L 2 } and 
Jb{f2) = {J-l, ei, 62 , Ti, J_2, /i, /2, T 2 }. 

The cut Jb{e) can also be viewed as the least consistent cut of the slice 
(if, —>■){, that contains the event e. The results in |E| establish that it is sufficient 
to know Jf,(e) for each event e in order to recover the slice. In particular, a 
directed graph with E as the set of vertices and an edge from an event e to an 
event / iff Jb(e) C Jb(f) is cut-equivalent to the slice {E,^)b- We also present 
an 0{N'^\E\) algorithm to compute Jb(e) for each event e. However, the graph 
so obtained can have as many as I7(|ifp) edges. 

Let Fb{e,i) denote the earliest event / on Pi such that Jb{e) C Jb{f). In- 
formally, Eb{e,i) is the earliest event on Pi that is reachable from e in the slice 
{E,^)b- For example, in Fig. EJ F’f,(ei, 1) = ci and Fb{ei,2) = f 2 - Given Jb{e) 
for each event e, Fb{e,i) for each event e and process Pi can be computed in 
0{N\E\) time jI2|. We now construct a directed graph that we call the skeletal 
representation of the slice with respect to b and denote it by Gb- The graph Gb 
has E as the set of vertices and the following edges: (1) for each event e, that 
is not a final event, there is an edge from e to succ(e), and (2) for each event e 
and process Pi, there is an edge from e to Fb{e,i). 

The skeletal representation of the slice depicted in Fig. Kb) is shown in Fig.K 
To prove that the graph Gb is actually cut-equivalent to the slice {E,^)b, it 
suffices to show the following: 

Theorem 5. For events e and f, Jb{e) C Jb{f) = (e, /) G V{Gb)- 

Besides having computational benefits, the skeletal representation of a slice 
can be used to devise a simple and efficient algorithm to graft two slices. 

6 Grafting Two Slices 

In this section, we present algorithm to graft two slices which can be done with 
respect to meet or join. Informally, the former case corresponds to the smallest 
slice that contains all consistent cuts common to both slices whereas the latter 
case corresponds to the smallest slice that contains consistent cuts of both slices. 
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In other words, given slices (i?, — and where bi and 62 are regular 

predicates, we provide algorithm to compute the slice where b is either 

61 n 62 = A 62 or 5 i U 62 = reg ( 5 iV 62)- Grafting enables us to compute the 
slice for an arbitrary boolean expression of local predicates — by rewriting it in 
DNF — although it may require exponential time in the worst case. Later, in this 
section, we present an efficient algorithm based on grafting to compute slice for 
a co-regular predicate (complement of a regular predicate). We also show how 
grafting can be used to avoid examining many consistent cuts when detecting a 
predicate under possibly modality. 

6.1 Grafting with Respect to Meet: b = bi □ b2 = bi A b2 

In this case, the slice {E,^)t, contains a consistent cut of (E,^) iff the cut 
satisfies bi as well as &2- Let F’min(e,*) denote the earlier of events 
and Eb^{e,i), that is, Fmin(e,f) = min{Fbj (e, i), f)}. The following lemma 

establishes that, for each event e and process Pi, Fmin(e,f) cannot occur before 
Fb{e,i). 

Lemma 2 . For each event e and process Pi, Fb{e,i) 2±p -Fmi„(e,i). 

We now construct a directed graph Gmin which is similar to Gt, the skeletal 
representation for {E, —>■){,, except that we use Fmin(e, i) instead of T&(e, i) in its 
construction. The next theorem proves that Gmin is cut-equivalent to G{,. 

Theorem 6. Gmin is cut- equivalent to Gb- 

Roughly speaking, the aforementioned algorithm computes the union of the 
sets of edges of each slice. Note that, in general, Fb{e,i) need not be same as 
i"min(e,^) H2|. This algorithm can be generalized to conjunction of an arbitrary 
number of regular predicates. 

6.2 Grafting with Respect to Join: b = bi U b2 = reg (bi V b2) 

In this case, the slice {E, —>■)(, contains a consistent cut of {E, — ?>) if the cut satis- 
fies either 61 or &2 • The dual of the graph Gmin — min replaced by max — denoted 
by Gmax (surprisingly) turns out to be cut-equivalent to the slice {E,^)b- As 
before, let Fmax(e,f) denote the later of events Fb^{e,i) and Fb^{e,i), that is, 
T’max(e,f) = max{Tbj (e, j), Fbj (e, i)}. The following lemma establishes that, for 
each event e and process Pi, Fb{e,i) cannot occur before Amax(e,f). 

Lemma 3. For eaeh event e and proeess Pi, J^max(e,f) Fb{e,i). 

We now construct a directed graph Gmax that is similar to Gb, the skeletal 
representation for {E,^)b, except that we use T'max(e,*) instead of Fb{e,i) in 
its construction. The next theorem proves that Gmax is cut-equivalent to Gb- 

Theorem 7. Gmax is cut- equivalent to Gb- 

Intuitively, the above-mentioned algorithm computes the intersection of the 
sets of edges of each slice. In this case, in contrast to the former case, Fb{e,i) 
is actually identical to Fmax (e,i) m This algorithm can be generalized to dis- 
junction of an arbitrary number of regular predicates. 
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6.3 Applications of Grafting 

Computing Slice for a Co-Regular Predicate. Given a regular predicate, 
we give an algorithm to compute the slice of a computation with respect to 
its negation — a co-regular predicate. In particular, we express the negation as 
disjunction of polynomial number of regular predicates. The slice can then be 
computed by grafting together slices for each disjunct. 

Let (if, — >■) be a computation and (if, —>■);, be its slice with respect to a regular 
predicate b. For convenience, let — >■(, be the edge relation for the slice. We assume 
that both — >■ and — >■{, are transitive relations. Our objective is to find a property 
that distinguishes the consistent cuts that belong to the slice from the consistent 
cuts that do not. Consider events e and / such that e f but e — >■{, /. Then, 
clearly, a consistent cut that contains / but does not contain e cannot belong 
to the slice. On the other hand, every consistent cut of the slice that contains / 
also contains e. This motivates us to define a predicate prevents{f, e) as follows: 

C satisfies prevents{f,e) = (/ S C) A (e ^ C) 

It can be proved that prevents{f, e) is a regular predicate j1 2\ . It turns out 
that every consistent cut that does not belong to the slice satisfies prevents{f, e) 
for some events e and / such that (e tA- /) A (e — /) holds. Formally, 

Theorem 8. Let C be a consistent cut of {E, — >■) . Then, 

C satisfies ~ib = (Be, f : (e — >■& /) A (e tA- /) : C satisfies prevents{f, e)) 

Theorem 0 implies that ~'b can be expressed as disjunction of preventsh 



Pruning State Space for Predicate Detection. Detecting a predicate un- 
der possibly modality is NP-complete in general |2I15I13| . Using grafting, we can 
reduce the search space for predicates composed from local predicates using 
A and V operators. We first transform the predicate into an equivalent predicate 
in which -i is applied directly to the local predicates and never to more complex 
expressions. Observe that the negation of a local predicate is also a local pred- 
icate. We start by computing slices with respect to these local predicates. This 
can be done because a local predicate is regular and hence the algorithm given 
in p] can be used to compute the slice. We then recursively graft slices together, 
with respect to the appropriate operator, working our way out from the local 
predicates until we reach the whole predicate. This will give us a slice of the 
computation — not necessarily the smallest — which contains all consistent cuts 
of the computation that satisfy the predicate. In many cases, the slice obtained 
will be much smaller than the computation itself enabling us to ignore many 
consistent cuts in our search. 

For example, suppose we wish to compute the slice of a computation with 
respect to the predicate {x\ V X2) A {x^ V X4), where Xi is a boolean variable 
on process pi. As explained, we first compute slices for the local predicates Xi, 
X2, X3 and X4. We then graft the first two and the last two slices together with 
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f, f2 ^2 



Fig. 5. An optimal algorithm to compute the slice for a conjunctive predicate. 

respect to join to obtain slices for the clauses x\ V X2 and X3 V X4, respectively. 
Finally, we graft the slices for both clauses together with respect to meet to get 
the slice for the predicate reg (a;i V X2) Areg {x^ V X4) which, in general, is larger 
than the slice for the predicate {x\ V X2) A {xs V 2:4) but much smaller than the 
computation itself. 

The result of Section 10 allows us to generalize this approach to predicates 
composed from arbitrary regular predicates using -1, A and V operators. We 
plan to conduct experiments to quantitatively evaluate the effectiveness of our 
approach. Although our focus is on detecting predicates under possibly modality, 
slicing can be used to prune search space for monitoring predicates under other 
modalities too. 

7 Optimal Algorithm for Slicing 

The algorithm we presented in to compute slices for regular predicates has 
0 {N'^\E\) time complexity, where N is the number of processes and E is the 
set of events. In this section we present an optimal algorithm for computing 
slices for special cases of regular predicates. Our algorithm will have 0(|if|) 
time complexity. Due to lack of space, only the optimal algorithm for conjunctive 
predicates is presented. The optimal algorithm for other regular predicates such 
as channel predicates can be found elsewhere m 

A conjunctive predicate is a conjunction of local predicates. For example, “Pi 
is in red state” A “P2 is in green state” A “P3 is in blue state”. Given a set of 
local predicates, one for each process, we can categorize events on each process 
into true events and false events. An event is a true event iff the corresponding 
local predicate evaluates to true, otherwise it is a false event. 

To compute the slice of a computation for a conjunctive predicate, we con- 
struct a directed graph with vertices as events in the computation and the follow- 
ing edges: (1) from an event, that is not a final event, to its successor, (2) from 
a send event to the corresponding receive event, and (3) from the successor of a 
false event to the false event. 

For the purpose of building the graph, we assume that all final events are true 
events. Thus every false event has a successor. The first two kinds of edges en- 
sure that the Lamport’s happened-before relation is captured in the graph. The 
algorithm is illustrated Fig. El In the figure, all true events have been encircled. 

It can be proved that the directed graph obtained is cut-equivalent to the slice 
of the computation with respect to the given conjunctive predicate 0. It is easy 
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to see that the graph has 0{\E\) vertices, 0{\E\) edges (at most three edges per 
event assuming that an event that is not local either sends at most one message 
or receives at most one message but not both) and can be built in 0(|i?|) time. 
The slice can be computed by finding out the strongly connected components 
of the graph jj- Thus the algorithm has 0{\E\) overall time complexity. It also 
gives us an 0(|if|) algorithm to evaluate possibly: b when & is a conjunctive 
predicate (see Theorem 0 ). 

By defining a local predicate (evaluated on an event) to be true iff the event 
corresponds to a local checkpoint, it can be verified that there is a zigzag path 
[11 411 from a local checkpoint c to a local checkpoint c' in a computation iff there 
is a path from succ(c), if it exists, to E in the corresponding slice — which can 
be ascertained by comparing Ji,(succ(c)) and Jb(c'). An alternative formulation 
of the consistency theorem in P! can thus be obtained as follows: 

Theorem 9. A set of local checkpoints can belong to the same consistent global 
snapshot iff the local checkpoints in the set are mutually consistent (including 
with itself) in the corresponding slice. 

Moreover, the R-graph (rollback-dependency graph) CHI is path-equivalent 
to the slice when each contiguous sequence of false events on a process is merged 
with the nearest true event that occurs later on the process. The minimum 
consistent global checkpoint that contains a set of local checkpoints CHI can be 
computed by taking the set union of Jfs for each local checkpoint in the set. 
The maximum consistent global checkpoint can be similarly obtained by using 
the dual of Jb- 

8 Conclusion and Future Work 

In this paper, the notion of slice introduced in our earlier paper is generalized 
and its existence for all global predicates is established. The intractability of 
computing the slice, in general, is also proved. An optimal algorithm to compute 
slices for special cases of predicates is provided. Moreover, an efficient algorithm 
to graft two slices is also given. Application of slicing in general and grafting 
in particular to global property evaluation of distributed programs is discussed. 
Finally, the results pertaining to consistent global checkpoints |TTITR] are shown 
to be special cases of computation slicing. 

As future work, we plan to study grafting in greater detail. Specifically, we 
plan to conduct experiments to quantitatively evaluate its effectiveness in weed- 
ing out unnecessary consistent cuts from examination during state space search 
for predicate detection. Another direction for future research is to extend the 
notion of slicing to include temporal predicates. 
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Abstract. Atomic commitment is one of the key functionalities of mod- 
ern information systems. Conventional distributed databases, transac- 
tion processing monitors, or distributed object platforms are examples 
of complex systems built around atomic commitment. The vast majority 
of such products implement atomic commitment using some variation of 
2 Phase Commit (2PC) although 2PC may block under certain condi- 
tions. The alternative would be to use non-blocking protocols but these 
are seen as too heavy and slow. In this paper we propose a non-blocking 
distributed commit protocol that exhibits the same latency as 2PC. The 
protocol combines several ideas (optimism and replication) to implement 
a scalable solution that can be used in a wide range of applications. 



1 Introduction 



Atomic commitment (AC) protocols are used to implement atomic transactions. 
Two-phase commit (2PC) is the most widely used AC protocol although 

its blocking behavior is well known. There are also non-blocking protocols but 
they have an inherent higher cost |l )S8,'IIK HOI | usually translated in either a 
explicit extra round of messages (5 phase commit (3PC) bke8llbke82lRD9Bl ) or 
an implicit one (when using uniform multicast Ifj'TDdl ) . 

The reason why 2PC is the standard protocol for atomic commitment is that 
transactional systems pay as much attention to performance as they do to consis- 
tency. For instance, most systems summarily abort those transactions that have 
not committed after a given period of time so that they do not keep resources 
locked. Existing non-blocking protocols resolve the consistency problem by in- 
creasing the latency and, therefore, are not practical. A realistic non-blocking 
alternative to 2PC needs to consider both consistency and transaction latency. 
Ideally, the non-blocking protocol should have the same latency as 2PC. Our 
goal is to implement such a protocol by addressing the three main sources of de- 
lay in atomic commitment: message overhead, forced writes to the log, and the 
convoy effect caused by transactions waiting for other transactions to commit. 

* This research has been partially funded by the Spanish National Research Council 
CICYT under grant TIC98-1032-C03-01. 
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To obtain non-blocking behavior, it is enough for the coordinator to use a 
virtual synchronous uniform multicast protocol to propagate the outcome of the 
transaction nTTT^ . This guarantees that either all or none of participants know 
about the fate of the transaction. Uniformity ensures the property holds 

for any participant, even if it crashes during the multicast. Unfortunately, uni- 
formity is very expensive in terms of the delay it introduces. In addition, since 
the delay depends on the size of the group, using uniformity seriously compro- 
mises the scalability of the protocol. To solve this two limitations, we use two 
different strategies. First, to increase the scalability, uniform multicast is used 
only within a small group of processes (the commit servers) instead of using it 
among all participants in the protocol. The idea is to employ a hierarchical con- 
figuration where a small set of processes run the protocol on behalf of a larger set 
of participants. Second, to minimize the latency caused by uniformity, we resort 
to a novel technique based on optimistic delivery that overlaps the processing of 
the transactional commit with the uniform delivery of the multicast. The idea 
here is to hide the latency of multicast behind operations that need to be per- 
formed anyway. This is accomplished by processing messages in an optimistic 
manner and hoping that most decisions will be correct although in some cases 
transactions might need to be aborted. This approach builds upon recent work 
in optimistic multicast and a more aggressive version of optimistic deliv- 

ery proposed in the context of Postgres-R |KPAS99| and later used to provide 
high performance eager replication in clusters IP.IkAbOI . We use an optimistic 
uniform multicast that delivers messages in two steps. In the first step messages 
are delivered optimistically as soon as they are received. In the second step mes- 
sages are delivered uniformly when they become stable. This optimistic uniform 
multicast is equivalent to a uniform multicast with safe indications [lyKCnh*^ . 



Forced writes to the log are another source of inefficiencies in AC protocols. 
To guarantee correctness in case of failures, participants must flush to disk a log 
entry before sending their vote. This log entry contains all the information needed 
by a participant to recall its own actions in the event of a crash. The coordinator 
is also required to flush the outcome of the protocol before communicating the 
decision to the participants (this log entry can be skipped by using the so called 
presume eommit or presume abort protocols fML()8B| l. Flushing log records adds 
to the overall latency as messages cannot be sent or responded to before writing 
to the log. In the protocol we propose, this delay is reduced by allowing sites 
to send messages instead of flushing log records. The idea is to use the main 
memory of a replicated group (the commit servers mentioned above) as stable 
memory instead of using a mirrored log with careful writes. 



Finally, to minimize the waiting time of transactions, in our protocol locks 
are released optimistically. The idea is that a transaction can be optimistically 
committed pending the confirmation provided by the uniform multicast. By op- 
timistically committing the transaction, other transactions can proceed although 
they risk a rollback if the transaction that was optimistically committed ended 
up aborting. In our protocol, the optimistic commit is performed in such a way 
that aborts are confined to a single level. In addition, transactions are only 
optimistically committed when all their participants have voted affirmatively. 
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thereby greatly reducing the risk of having to abort the transaction. This con- 
trasts with other optimistic commit protocols, e.g., , where transactions 

that must abort (because one or more participants voted abort) can be opti- 
mistically committed although they will rollback anyway producing unnecessary 
cascading aborts. 

With these properties the protocol we propose satisfactorily addresses all 
design concerns related to non-blocking AC and can thus become an important 
contribution to future distributed applications. The paper is organized as fol- 
lows, Section 121 describes the system model. Section |3 and 0 present the commit 
algorithm and its correctness. Section 0 concludes the paper. 

2 Model 

2.1 Communication Model 

The system consists of a set of fail-crash processes connected through reliable 
channels. Communication is asynchronous and by exchanging messages. A failed 
process can later recover with its permanent storage intact and re-join the sys- 
tem. Failures are detected using a (possibly unreliable) failure detectoi0 )CT96j . 

A virtual synchronous multicast service |BSS91IBir96ISRp is used. This 
service delivers multicast messages and views. Views indicate which processes 
are perceived as up and connected. We assume a virtual synchrony with the fol- 
lowing properties: (1) Strong virtual synchrony ITTTTtJ or sending view delivery 
IV KCU95I that ensures that messages are delivered in the same view they were 
sent; (2) Majority or primary component views that ensure that only members 
within a majority view can progress, while the rest of the members block un- 
til joining again the majority view; (3) Liveness, when a member fails or it is 
partitioned from the majority view, a view excluding the failed member will be 
eventually delivered. 

The protocol uses two different multicast primitives 1HT53ISS53I : reliable 
(rel-multicast) and uniform multicasts (uni-multicast). Three primitives define 
optimistic uniform reliable multicastfl: Uni-multicast(m, g) multicasts message 
m to a group g. Opt-deliver(m) delivers m reliably to the application. Uni- 
deliver(m) delivers m to the application uniformly. We say that a process is 
Vi — correct in a given view Vi if it does not fail in Vi and if exists, it transits 
to it. The rel- and uni-multicasts preserve the following properties, where m is 
a message, g a group of processes and Vi a view within this group: 

OM- Validity: If a correct process rel or uni- multicast m to g in Vi, m will be 
eventually opt-delivered by every u^-correct process. 

OM- Agreement: If a Ui-correct process opt-delivers m in Vi, every rii-correct 
process will eventually opt-deliver m. 

^ The failure detector must allow the implementation of the virtual synchrony model 
described below (e.g., the one proposed in ISESSl) and the non-blocking atomic 
commitment (e.g., the one in |(f LS9,5| i. 

^ Senders are not reqnired to belong to the target group. 
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OM-Integrity: Any message is opt and uni-delivered by a process at most 
once. A message is opt-delivered only if it has been previously multicast. 

Uni-multicast additionally fulfills the following properties: 

OM-Uniform- Agreement: If a majority of z;i-correct processes opt-deliver m, 
they will eventually uni-deliver m. If a process uni-delivers m in Vi, every 
rii-correct process will eventually uni-deliver m. 

OM-Uniform-Integrity: A message is uni-delivered in Vi only if it was previ- 
ously opt-delivered by a majority of processes in Vi. 



2.2 Transaction Model 

Clients interact with the database by issuing transactions. A transaction is a 
partially ordered set of read and write operations followed by either a commit 
or an abort operation. The decision whether to commit or abort is made after 
executing an optimistic atomic commitment protocol. The protocol can decide 
(1) to immediately abort the transaction, (2) to perform an optimistic commit, 
or (3) decide to commit or abort. 

We assume a distributed database with n sites. Each site i has a transaction 
manager process TMi. When a client submits a transaction to the system, it 
chooses a site as its local site. The local TMi decides which other sites should get 
involved in processing the transaction and initiates the commitment protocol. 

We assume the database uses standard mechanisms like strict 2 phase locking 
(2PL) to enforce serializahility [ljli(187 |. The only change over known protocols 
is introduced during the commit phase of a transaction. When a transaction t 
is optimistically committed all its write locks are changed to opt locks and all 
its read locks are releasecQ. Opt locks are compatible with all other types of 
locks. That is, other transactions are allowed to set compatible locks on data 
held under an opt lock. Such transactions are said to be on-hold, while the rest 
of transactions are said to be on normal status. When the outcome of t is finally 
determined, its opt locks are released and all transactions that were on-hold 
due to these opt locks are returned to their normal state. A transaction that is 
on-hold cannot enter the commit phase until it returns to the normal state. 



2.3 System Configuration 

For the purposes of this paper, we will assume there are two disjoint groups of 
processes in the system. The first will conform the distributed database and will 
be referred as the transaction managers or TM group (TM = {TM\, ...,TMn}). 
Sites in this group are responsible for executing transactions and for trig- 
gering the atomic commitment protocol. By participants in the protocol, we 
mean processes in this group. The second group, commit server or CS group 
{CS = {Cl, ..., C„}) is a set of replicated processes devoted to perform the AC 

^ Once a transaction concludes, all its read locks can be released without compromising 
correctness independently of whether the transaction commits or aborts ettuhti . 
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protocol. We assume that in any two consecutive views, there is a process that 
transits from the old view to the new on^. 

2.4 Problem Definition 

A non-blocking AC protocol should satisfy: (1) NB AC-Uniform validity, a 
transaction is (opt) committed only if all the participants voted yes; (2) 
NB AC-Uniform- Agreement, no two participants decide differently; (3) NBAC- 
Termination, if there is a time after which there is a majority view sequence 
in the CS group that permanently contains at least a correct process, then the 
protocol terminates; (4) NBAC-Non- Triviality if all participants voted 

yes, and there no failures or false suspicions, then commit is decided. 

3 A Low Latency Commit Algorithm 

3.1 Protocol Overview 

The AC protocol starts when a client requests to commit a transaction. The 
commit request arrives at a transaction manager, Tilfy, which then starts the 
protocol. The protocol involves several rounds of messages in two phases: 

First phase 

1. Upon delivering the commit request, TM, multicasts a reliable prepare to 
commit message to the TM group. This message contains the transaction 
identifier ( tid ) to be committed and the number of participants involved 
(the number of TMs contacted during the execution of the transaction). 

2. Upon delivering the prepare to commit message, each participant uni- 
multicasts its vote and the number of participants to the CS group. If a 
participant has not yet written the corresponding entries to its local log 
when the prepare to commit message arrives, it sends the log entry in addi- 
tion to its vote without waiting to write to the log. After the message has 
been sent, it then writes the log entry to its local disk. 

Second phase 

1. Upon opt- delivering a vote message, the processes of the commit server decide 
who will act as proxy coordinator for the protocol based on the tid of the 
transaction and the current view. Assume this site is C^. The rest of the 
processes in the CS group act as backup in case Ci fails. If a no vote is 
opt-delivered, the transaction is aborted immediately and an abort message 
is reliable multicast to the TM group. If all votes are yes, as soon as the last 
vote is opt-delivered at Ci, Ci sends a reliable multicast with an opt-commit 
message to the TM group. 

^ It might seem a strong assumption for safety that at least one server must survive 
between views. However, this assumption is no stronger than the usual one that 
assumes that the log is never lost. The strength of any of the assumptions depends 
on the probability of the corresponding catastrophic failures. 
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2. Upon delivering an abort message, a participant aborts the transaction. Upon 
delivering an opt-commit message, the participant changes the transaction 
locks to opt mode. 

3. If all votes are affirmative, when they have been uni-delivered at Ci, Ci 
reliable multicasts to the TM group a commit message. 

4. When a participant delivers a commit or abort message, it releases all locks 
(both opt and non-opt) held by the transaction and return the corresponding 
transactions that were on hold to their normal state. 

5. If all the votes are affirmative, the coordinator opt-commits the transaction 
before being excluded from the majority view (before being able to commit 
the transaction), and one or more votes do not reach the majority view, the 
transaction will be aborted by the new coordinator. 

This protocol reduces the latency of the non-blocking commit in several ways. 
First, at no point in time in the protocol must a site wait to write a log entry to 
the disk before reacting to a message. The CS group acts as stable storage for 
both the participants (sites at the TM which could not yet write their vote and 
other transaction information to disk when the prepare to commit vote arrives) 
and the CS group itself (the coordinator does not need to write an entry to the 
log before sending the opt-commit message). Second, the coordinator in the CS 
group provides an outcome without waiting for the vote messages to be uniform. 
This reduces the overhead of uniform multicast as it overlaps its cost with that 
of committing the transaction. 

3.2 The Protocol 

The protocol uses the CS group to run the atomic commitment. The processes 
in the TM grouj0 only act as participants and the CS group acts as coordinator. 
We use two tables, transdab and vote-tab, to store information in main memory 
about the state of a transaction and the decision of each participant regarding 
a given transaction at each CSi. We also use a number of functions to change 
and access the values of the attributes in these tables. Trans J,ab contains the at- 
tributes tid (the transaction’s identifier), mparticipants (number of participants 
in that transaction, all sites in the TM group), timestamp (of the first vote for 
timeout purposes), coordinator (id of the coordinator site in the CS group; this 
attribute is initially set with the function storeTrans and updated with the func- 
tion store_coordinator), and outcome (the state of the transaction; initially it is 
undecided, the state can be changed to aborted, opt-committed or committed by 
invoking the function storc-Outcome) . Vote-tab contains the attributes tid (the 
transaction’s identifier), participant-id (site emitting the vote, which must be a 
site in the TM group), vote (the actual vote), votestatus (initially optimistic, 
when set with the function store_opt_vote, and later definitive, when set with 
the function store_def_vote), and log (any log entry the participant may have 
sent with the vote). There are additional functions to consult the attributes 

® For simplicity, messages are multicast to all TM processes. Processes for which the 
message is not relevant just discard it. 
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associated to each tid in the transdab. These functions are denoted with the 
same name as the attribute but starting with capital letter (e.g., Timestamp). 
There are also functions to consult the votedah: Log (to obtain the log sent by 
a participant), N_opt-yes-Votes (number of yes votes delivered optimistically for 
a particular transaction), N_def-yesjvotes (similarly for uni-delivered yes votes). 
An additional function. Coordinator, is used to obtain the id of the coordinator 
of a transaction given its tid and the current view. 

TM Group actions: 

I TM.A I Upon delivering Prepare(tid): 
if prepared in advance then 

uni-multicast(CS, Vote(tid, n_participants, my_id, vote, empty)) 
else 

uni-multicast(CS, Vote(tid, n.participants, my_id, vote, log_record)) 
end if 



TM.B I Upon delivering Opt-commit(tid) : 
Change transaction tid locks to opt-mode 



TM.C I Upon delivering Commit/ Abort (tid): 

Commit/Abort the transaction and release transaction tid locks 
Change the corresponding on-hold transactions to normal status 



CS Group actions: 

I CS.A I Upon opt-delivering Vote(tid, n.participants, participantjd, vote, log): 
store_opt_vote(vote_tab, tid, participant-id, vote, log) 

- the transaction outcome is still undecided 
if Outcome(trans_tab, tid) = undecided then 
if vote = no then 

if Coordinator(current_view, tid) = my_id then 
rel_multicast(TM, Abort(tid)) 
end if 

store_outcome(trans_tab, tid, aborted) 
else - vote — yes 

if N_opt_yes_votes(vote_tab, tid) = 1 then - it is the first vote 
timestamp — current-time 

if Coordinator(current_view, tid) = my_id then 
set_up_timer(tid, timestamp-|- waiting-time) 
end if 

store-trans(trans-tab, tid, n-participants, timestamp) 
end if 

if N_opt_yes_votes(vote_tab, tid) = n_participants(trans_tab, tid) then - all voted yes 
store-Outcome(trans-tab, tid, opt-committed) 
if Coordinator(current-view, tid) = my -id then 
disable -timer (tid) 

rel_multicast(TM, Opt-commit(tid)) 
end if 
end if 
end if 



CS.B I Upon uni-delivering Vote(tid, n_participants, participant-id, vote, log): 
store-def-Vote(trans-tab, tid, participant-id) 

if (N-def-yes-Votes(vote-tab, tid) = n-participants(trans-tab, tid)) 
and (Outcome(trans-tab, tid) 7 ^ abort) then 
store-Outcome(trans-tab, tid, committed) 
if Coordinator(current-view, tid) = my -id then 
reLmulticast(TM, Commit(tid)) 
end if 
end if 



CS.C I Upon expiring Timer(tid): 
store-Outcome(trans-tab, tid, aborted) 
uni-multicast(CS, Timeout(tid)) 
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I CS.D I Upon uni-delivering Timeout (tid): 
store_outcome(trans_tab, tid, aborted) 
if Coordinator(current_view, tid) = my_id then 
rel_multicast(TM, Abort(tid)) 
end if 

I CS.E I Upon delivering ViewChange(r;i): 
current-view — vi 

- State synchronization with new members 

if my_id is the lowest in Vi that belonged to vi-i then 
for every Ci G Vi do 
if Ci 0 Vi— I then 

send(Ci, State(trans_tab, vote_tab)) 
end if 
end for 

elsif my_id 0 Vi — i then I am a new member 
receive(State(trans_tab, vote.tab)) 
end if 

- Assignment of new coordinators in Vi 
for each tid G trans.tab do 

if Coordinator(r;i , tid) = my_id then 

if Outcome(trans_tab, tid) = committed then 
rel_multicast(TM, Commit(tid)) 
elsif Outcome(trans_tab, tid) = aborted then 
rel_multicast(TM, Abort(tid)) 
else 

set_up_timer{Timestamp(trans_tab, tid) + waiting-time) 
end if 
end if 
end for 
end if 



Dealing with coordinator failures. Since sites in the TM group only act 
as participants, failures in the TM group do not affect the protocol. In the CS 
group, all processes are replicas of each other. Strong virtual synchrony ensures 
that any pending message sent in the previous view is delivered before delivering 
a new view. Thus, when a process fails (or it is falsely suspected), a new view is 
eventually delivered to a majority of available connected CS processes. Once the 
new view is available, a working CS process takes over as coordinator for all the 
on-going commitment protocols coordinated by the failed process (CS.E). For 
each on-going transaction commit, the new coordinator checks the delivery time 
of the first vote and sets up a timer accordingly (CS.E). The actions taken by 
the new coordinator at this point in time depend on the protocol stage. If the 
transaction outcome is already known (all the votes have been opt-delivered at all 
CS members or a no vote message has been opt-delivered), the new coordinator 
multicasts the outcome to the participants (CS.E). If the outcome is undecided 
(i.e., all previously delivered votes were affirmative and there are pending votes), 
the protocol proceeds as normal and the new coordinator waits until all vote 
messages (or a no vote) have been opt-delivered (CS.A). 

The problematic case is when the coordinator has decided to commit the 
transaction and then it is excluded from the view. It can be the case that the 
coordinator has had time to opt-commit the transaction, but not to commit it, 
and that there are missing votes in the majority view. In traditional 2PC, this 
situation is avoided by blocking. In our protocol, the blocking situation is avoided 
by the use of uniform multicast within the server group. The surviving sites can 
safely ignore the previous coordinator: due to uniformity, at worst, they will be 
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aborting an opt-committed transaction, which does not violate consistency. A 
more accurate characterization of the rollback situation follows: 

— All the votes are yes and have been opt-delivered at the coordinator. 

— The coordinator successfully multicasts the opt-commit to the participants. 

— The vote from at least one of the participants (e.g., p) is not uni-delivered. 

— The coordinator and p become inoperative (e.g., due to a crash, a false 
suspicion, etc.) in such a way that the multicast does not reach any other 
member of the new primary view. 

Although such a situation is possible, it is fair to say that it is extremely 
unlikely. Even during instability periods where false suspicions are frequent and 
many views are delivered, the odds for a message being opt-delivered only at the 
coordinator and both the coordinator and p become inoperative before the mul- 
ticast of missing votes reaches any other CS member are very low. Additionally, 
this has to happen after the opt-commit is effective, as otherwise there would not 
be any rollback. Being such a rare situation, the amount of one-level aborts will 
be minimal even if the protocol is making optimistic decisions. It is also possible 
to enhance the protocol by switching optimism off during instability periods. 



Bounding commit duration. To guarantee the liveness of the protocol and 
to prevent unbounded resource contention it is necessary to limit the duration of 
the commit phase of a transaction. This limitation is enforced by setting a timer 
at the coordinator when it receives the first vote from a transaction (CS.A). The 
rest of the members timestamp the transaction with the current time when they 
opt-deliver the first vote. If all participant votes have reached the coordinator 
before the timer expires, the timer is disabled (CS.A). Otherwise, the coordinator 
decides to abort the transaction but it does not immediately multicast the abort 
decision to the TM group (CS.C). Instead, it uni-multicasts a timeout message to 
the CS group. When this message is uni-delivered at the coordinator, a message 
is sent to the participants with the abort decision (CS.D). 

It could be the case that the coordinator multicast a timeout message and, 
before uni-delivering it, the missing votes are opt-delivered at the coordinator. 
In that case the transaction will be aborted (its outcome is not undecided when 
the vote is opt-delivered since in CS.C the outcome is set to abort when the 
timer expires). The rest of the CS members will also abort the transaction, no 
matter the order in which those messages are delivered. If the missing votes 
are delivered before the timeout message, the transaction outcome will be set 
to commit (CS.A) until the timeout message is uni-delivered (if so). Upon uni- 
delivery of the timeout message the outcome is changed to abort (CS.D). If 
the timeout message is uni-delivered before the last vote at a CS member, the 
transaction outcome will be initially set to abort (CS.D) and remain so (CS.A). 

The coordinator can be excluded from the majority view during this process 
and a new coordinator will take over. If the new coordinator has uni-delivered 
the timeout message, the outcome of the transaction will be abort (CS.E). This 
will happen independently of whether the old coordinator sent the abort mes- 
sage to the TM group. If the new coordinator has not uni-delivered the timeout 
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message before installing the new view, the failed coordinator did not uni-deliver 
it either (due to uniformity) and the new coordinator will not deliver that mes- 
sage (strong virtual synchrony). Therefore, the new coordinator will behave as a 
regular coordinator and will set the timer and wait for any pending vote (CS.E 
to set the timer, and CS.A in the event a vote arrives). 

Despite the majority view approach, the protocol would not terminate if all 
the coordinators assigned to a transaction are excluded from the view before de- 
ciding the outcome. For instance, the CS group can transit perpetually between 
views {1,2, 3,4} and (1,2, 3, 5}, with processes 4 and 5 being the coordinators of 
a transaction t in each view. In this case, t will never commit. This scenario can 
be avoided by, whenever possible, choosing a coordinator that has not previ- 
ously coordinated the transaction. It there is at least a correct process, this will 
guarantee that the outcome of t will be eventually decided, thereby, ensuring the 
liveness of the algorithm. 



Maintaining consistency across partitions. Although partitions always 
lead to blocking, our protocol maintains consistency even when partitions oc- 
cur. That is, no replica decides differently on the outcome of a transaction even 
when the network partitions. Consistency is enforced by combining uniformity, 
strong virtual synchrony, and majority views. To see why, we will only consider 
partitions in the CS group. Partitions in the TM may lead to delays in the vote 
delivery (which may result in a transaction abort) and to delays in the propaga- 
tion of the transaction outcome (thus, resulting in blocking during the partition) . 
Partitions that leave the coordinator of a transaction in the majority partition 
of the CS group are not a problem, as the minority partition gets blocked (due 
to the majority view virtual synchrony). Since the transaction outcome is always 
decided after the uni-delivery of a message (either a vote or timeout message), 
uniformity guarantees that the decision will be taken by every process in the 
majority view. When the coordinator of a transaction is in a minority parti- 
tion, undecided transactions cannot create problems as the coordinator, once in 
the minority partition, will block. When this happens, a new coordinator can 
make any decision regarding undecided transactions without compromising con- 
sistency. Only transactions whose outcome has been decided by the coordinator 
during the partition may lead to inconsistencies. There are four cases to consider: 

— The coordinator optimistically commits a transaction when it opt-delivers 
the last vote (and all votes have been affirmative). Assume a partition leaves 
the coordinator in a minority partition. The new coordinator may (1) opt- 
deliver all the votes or (2) never deliver one or more votes. In the first case, 
it will opt-commit the transaction (CS.A) thereby agreeing with the old 
coordinator. In the second case, it will abort the transaction once the timer 
expires (CS.C). Since the transaction was only optimistically committed by 
the old coordinator, the new coordinator is free to decide to abort without 
violating consistency. 

— If the old coordinator committed a transaction, the new coordinator will 
do the same. A transaction is committed when all the votes have been uni- 
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delivered (CS.B). If the votes were uni-delivered at the old coordinator, all 
the processes in that view also uni-delivered them in that view (uniformity 
and strong virtual synchrony). Thus, the new coordinator will also commit 
the transaction (CS.E). 

— The old coordinator opt-delivered a no vote and aborted the transaction. 
The new coordinator will either have delivered the no vote or timed out. In 
both cases the new coordinator will also abort the transaction (CS.A and 
CS.C, respectively) thereby agreeing with the old coordinator. 

— The old coordinator timed out and aborted the transaction. The transaction 
will not be effectively aborted until the timeout message is uni-delivered. 
Uniformity guarantees that the timeout message, if uni-delivered, will be 
uni-delivered to both the old and the new coordinator, thereby preventing 
any inconsistency. 



Replica recovery and partition merges. In order, to maintain an appropri- 
ate level of availability, it is necessary to enable new (or recovered) replicas to 
join the CS group and to allow partitioned groups to merge again. When a new 
process joins the CS group, virtual synchrony guarantees that the new process 
will deliver all the messages delivered by the other replicas after installing the 
new view (and thus, after state synchronization). The installation of the new 
view will trigger state synchronization (CS.E). This involves sending from an 
old member of the group (one that transits from the previous view to the cur- 
rent one, that it is guaranteed to exist due to the majority view approach) a 
State message with the vote_tab and the trans_tab tables to the new member. 
Members from a minority partition that join a majority view will be treated 
as recovered members, that is, they will be sent the up-to-date tables from a 
process belonging to the previous majority view. 

The state transfer and the assumption that at least a process from the pre- 
vious view transits to the next view guarantees that a new member acting as 
coordinator will use up-to-date information, thereby ensuring the consistency of 
the protocol. The recovery of a participant has not been included in the algo- 
rithm due to its simplicity (upon recovery it will just ask to the CS group about 
the fate of some transactions) . 

4 Correctness 

Lemma 1 (NB AC-Uniform- Validity). A transaction (opt)commits only if 
all the participants voted yes. □ 

Proof (lemma (Opt) Commit is decided when the coordinator multicasts 
such message after the transaction has been recorded as (opt) committed in the 
trans-tab (CS.A). This can only happen when all participant votes have been 
(opt) uni-delivered and they are yes votes. □ 



Lemma 2 (NBAC-Uniform- Agreement). No two CS members decide dif- 
ferently. □ 
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Proof (lemmaEI) : In the absence of failures the lemma is proved trivially, as only 
the coordinator decides about the outcome of the transaction. The rest of them 
just logs the information about the transaction in case they have to take over. 
The only way for two members to decide on the same transaction is that one is 
a coordinator of the transaction and then it is excluded from the majority view 
before deciding the outcome. Then, a CS member takes over as new coordinator 
and decides about the transaction. 

Assume that a coordinator makes a decision and due to its exclusion from 
view Vi, a new coordinator takes over and makes a different decision. Let us 
assume without loss of generality that the new coordinator takes over in view 
Vi+i (in general, it will be in Vi+f). The old coordinator can have decided to: 

1. Commit. The old coordinator can only decide commit if all votes have been 
uni-delivered, and they all were affirmative. If the new coordinator decides 
to abort, it can only be because it has not uni-delivered one or more votes 
neither before the view change nor before its timer expires. In this situation, 
there are two cases to consider: 

a. The new coordinator belonged to Vi. Hence, all votes were uni-delivered 
in Vi at the old coordinator (which needed all yes votes to decide to 
commit) but not at the new coordinator (otherwise it would also decide 
to commit) what violates multicast uniformity. 

b. The new coordinator joined the CS group after a recovery or a partition 
merge. From the recovery procedure, the new coordinator has gotten 
the most up-to-date state during the state transfer triggered by the view 
change. If it decides to abort, it is because a process in view Vi (the one 
which sent its tables in the state transfer) did not uni-deliver one or more 
votes. This again violates uniformity and it is therefore impossible. 

2. Abort due to a no vote. If the old coordinator aborts the transaction, it 
does so as soon as the no vote is opt-delivered (CS.A). In order to decide to 
commit, the new coordinator needs to uni-deliver all votes and that all votes 
are yes. Since a participant votes only once, this situation cannot occur (for 
it to occur, a participant needs to say no to the old coordinator and yes to 
the new one). 

3. Abort due to a timeout. If the old coordinator decided to abort due to a 
timeout, then it uni-delivered its own timeout message (CS.C). If the new 
coordinator decides to commit, then it must have uni-delivered all the votes 
before its timer expires and before uni-delivering the timeout message. This 
implies that the timeout message has been delivered to the old coordinator in 
view Vi and not to the new coordinator. Now there are two cases to consider: 

a. If the new coordinator was in view Vi, the fact that the old coordinator 
has not received the timeout message violates multicast uniformity. It is 
therefore not possible for the new coordinator to have been in Vi. 

b. If the new coordinator was not in view Vi then it has joined the group 
in view Ui+i, and thus during the state synchronization it has received 
the most up-to-date tables. However, this implies that some process in 
the CS group was in view Vi, transited to t’i+i, but did not uni-deliver 
the timeout message. Again this violates uniformity. 
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From here, since in all possible cases the new coordinator cannot make a 
different decision than the old coordinator once the latter has made a decision 
(abort or commit), the lemma is proven. □ 

Lemma 3 (NBAC-Termination). If there is a time after, which there is a 
majority view sequence in the CS group that permanently contains at least a 
correct process, then the protocol terminates. □ 

Proof (lemma EJ: Assume for contradiction that the protocol never ends. This 
means that either: 

— A correct coordinator never decides. A correct coordinator will either: (1) 
Opt-deliver a no vote, in which case the transaction is aborted, or (2) uni- 
deliver all the votes and are all yes, in which case the transaction is commit- 
ted, or (3) uni-deliver the timeout message before opt-delivering all the votes 
(and being all previous votes affirmative), in which case the transaction is 
aborted. Therefore, if there is a correct process, there will eventually be a 
correct coordinator that will decide the transaction outcome and multicast 
it to the participants, thus terminating the protocol. 

— There is an infinite sequence of unsuccessful coordinators that do not termi- 
nate the protocol. The NewCoordinator function, whenever possible, chooses 
a fresh coordinator (a process that did not previously coordinate the trans- 
action). This means that a correct process p will eventually coordinate the 
transaction. Since p is correct and belongs to the majority view, it will even- 
tually terminate the protocol as shown before. 

Thereby, it is proven that the protocol eventually terminates. □ 

Lemma 4 (NBAC-Non- Triviality). If all participants votes yes there are no 
failures nor false suspicions then commit will be decided. □ 

Proof (lemma EJ : The abort decision can only be taken when the coordinator 
receives a no vote, or because it times out. Otherwise, the decision is commit. □ 

Theorem 1 (NBAC-Correctness). The protocol presented in the paper ful- 
fills the non-blocking atomic commitment properties: NBAC-Validity, NB AC- 
Uniform- Agreement, NBAC-Termination, and NBAC-Non- Triviality. □ 

Proof (theorem Q : It follows from lemmas Q E] E] and E| □ 

5 Conclusions 

Atomic commitment is an important feature in distributed transactional sys- 
tems. Many commercial products and research prototypes use it to guarantee 
transactional atomicity (and, with it, data consistency) across distributed ap- 
plications. The current standard protocol for atomic commitment is 2PC which 
offers reasonable performance but might block when certain failures occur. In 
this paper we have proposed a non-blocking atomic commitment protocol that 
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offers the same reasonable performance as 2PC but that is non-blocking. Un- 
like previous work in the area, we have emphasized several practical aspects of 
atomic commitment. First, the new protocol does not create any additional mes- 
sage overhead when compared with 2PC. Second, by using a replicated group 
as stable memory instead of having to flush log records to the disk, the protocol 
is likely to exhibit a shorter response time than standard 2PC. Third, the fact 
that the second round of the commit protocol is run only by a small subset of 
the participants minimizes the overall overhead. Fourth, and most relevant in 
practice, the new protocol can be implemented on top of the same interface as 
that used for 2PC. This is because, unlike most non-blocking protocols that have 
been previously proposed, the participants only need to understand a prepare to 
commit message (or vote-request) and then a commit or abort message. This is 
exactly the same interface required for 2PC and it is implemented in all transac- 
tional applications. Because of these features, we believe the protocol constitutes 
an important contribution to the design of distributed transactional systems. We 
are currently evaluating the protocol empirically to get performance measures 
and are looking into several possible implementations to further demonstrate the 
advantages it offers. 
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Abstract. We introduce the notion of stable leader election and derive several 
algorithms for this problem. Roughly speaking, a leader election algorithm is stable 
if it ensures that once a leader is elected, it remains the leader for as long as it does 
not crash and its links have been behaving well, irrespective of the behavior of other 
processes and links. In addition to being stable, our leader election algorithms have 
several desirable properties. In particular, they are all communication-efficient, i.e., 
they eventually use only n links to carry messages, and they are robust, i.e., they 
work in systems where only the links to/from some correct process are required 
to be eventually timely. Moreover, our best leader election algorithm tolerates 
message losses, and it ensures that a leader is elected in constant time when the 
system is stable. We conclude the paper by applying the above ideas to derive a 
robust and efficient algorithm for the eventually perfect failure detector OP. 



1 Introduction 

I. 1 Motivation and Background 

Failure detection is at the core of many fault-tolerant systems and algorithms, and the 
study of failure detectors has been the subject of intensive research in recent years. In 
particular, there is growing interest in developing failure detector implementations that 
are efficient, timely, and accurate 1VRMMH98 LAF99 ClAUU FKIUO LFAUUbl . 

A failure detector of particular interest is fi l(JH196l . At every process p, and at 
each time t, the output of 17 at p is a single process, say q. We say that p trusts q to be 
up at time t. 17 ensures that eventually all correct processes trusts the same process and 
that this process is correct. 

Note that a failure detector 17 can also be thought of as a leader elector: The process 
currently trusted to be up by p, can be thought of as the current “leader” of p, and 17 
ensures that eventually all processes have the same leader. 

An 17 leader election is useful in many settings in distributed systems. For example, 
some algorithms use it to solve consensus in asynchronous systems with failures HLam9SI 
IMROO LFAOQal (in fact, 17 is the weakest failure detector to solve consensus ICHT96I ). 
Electing a leader can also be useful to solve a set of tasks efficiently in distributed 

J. Welch (Ed.): DISC 2001, LNCS 2180, pp. lOS lTTI 2001. 
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environments IDHW98II . Even though fl is strong enough to solve hard problems such 
as consensus, we will see that it is weak enough to admit efficient implementations. 

Our main goal here is to propose efficient algorithms for 17 in partially synchronous 
systems with process crashes and message losses. To illustrate this problem, consider the 
following simple implementation of 17 ILam98; DPLL97I . Assume that processes may 
crash, but all links are eventually timely, i.e., there is a time after which all messages 
sent take at most 5 time to be received. In this system, one can implement 17 as follows: 

1 . Every process p periodically sends an OK mess^e to all, and maintains a set of 
processes from which it received an OK recently^ 

2. The output of 17 at p is simply the smallest process currently in p’s set. 

Note that the set of processes that p builds in part (1) is eventually equal to the set 
of all correct processes. Thus part (1) actually implements an eventually perfect failure 
detector OP IC I'961 . So, in the above algorithm, we implement 17 by first implementing 
OP and then outputting the smallest process in the set of processes trusted by OP. 
This implementation of 17 has several drawbacks: 

1 . The system assumptions required by the algorithm are too strong. In fact, this algo- 
rithm works only if all links are eventually timely. Intuitively, however, one should 
be able to find a leader in systems where only the links to and from some correct 
process are eventually timely. In other words, while this algorithm requires n? even- 
tually timely links, we would like an algorithm that works even if there are only n 
eventually timely links (those to and from some correct process). 

2. The algorithm is not communication-efficient. In this algorithm, every process sends 
an OK message to all processes, forever. That is, all the nf links carry messages, 
in both directions, forever. Intuitively, this is wasteful: once a correct process is 
elected as a leader, it should be sufficient for it to periodically send OK messages 
to all processes (to inform them that it is still alive and so they can keep it as their 
leader), and all other processes can keep quiet. In other words, after an election 
is over, no more than n links should carry messages (those links from the elected 
leader to the other processes). All the other links should become quiescent. We say 
that a leader election algorithm is communication-efficient if there is a time after 
which it uses only unidirectional n links. 

3 . The election is not stable. In this algorithm, processes can demote their current leader 
and elect a new leader for no real reason: even if the current leader has not crashed 
and its links have been timely for an (arbitrarily) long time, the leader can still be 
demoted at any moment by an extraneous event. To see this, suppose process 2 is 
trusted forever by OP (because it is correct and all its links are timely) and that it 
is the current leader (because it is the smallest process currently trusted by OP). 
If OP starts trusting process 1 (this can occur if the links from process 1 become 
timely), then 2 loses the leadership and 1 is elected. If later 1 is suspected again, 
2 regains the leadership. So 2 loses the leadership each time 1 becomes “good” 

* “Recent” means within A from the last OK received. If processes send OK every rj then 
A = S p. If 5 and p are not known, p sets A by incrementing it for every mistake it makes 
ET96) . 
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again, even though 2 keeps behaving well and remains trusted forever by all the 
processes! This is a serious drawback, because leadership changes are quite costly 
to most applications that rely on a leader. Thus, we are seeking a stable leader 
election algorithm. Roughly speaking, such an algorithm ensures that once a leader 
is elected, it remains the leader for as long as it does not crash and its links have 
been behaving well, irrespective of the behavior of other processes or links. 

Our main goal here is to give algorithms for Q (i.e., leader election algorithms) that are 
communication-efficient, stable, and that work in systems where only the links to and 
from some correct process are required to be eventually timely, as explained above. In 
addition, we want an algorithm for 17 that can elect a leader quickly when the system 
“stabilizes”, i.e., it has a small election time. 

We achieve our goal progressively. We first present an algorithm for f? that is 
communication-efficient. This algorithm is simple, however it has the following draw- 
backs: (a) it is not stable, (b) it assumes that messages are not lost and (c) its worst-case 
election time is proportional to nil even when the system is stable. We next modify this 
algorithm to achieve stability. Then we change it so that it works despite message losses. 
Finally, we modify it to achieve constant election time when the system “stabilizes”. It 
is worth noting that our algorithms are self-stabilizing. 

We conclude the paper by using our techniques to give an algorithm for OP that is 
both robust and efficient: In contrast to previous implementations of OP, our algorithm 
works in systems where only n bidirectional links are required to be eventually timely, 
and there is a time after which only n bidirectional links carry messages. This algorithm 
for OP works despite message losses. 

1.2 Related Work 

The simple implementation of 17 described above is mentioned in several works (e.g., 
ILam9S DPLL97I 1. Such an implementation, however, requires strong systems assump- 
tions, is not communication-efficient, and is not stable. Larrea et al. give an algorithm 
for 17 that is communication-efficient, but it requires strong systems assumptions, and is 
not stable [LFAUUbI . An indirect way to implement 17 is to first implement an eventually 
strong failure detector OS ICT96I and then transform it into 17 using the algorithms in 
lChu98l . But such implementations also have drawbacks. First, the known implemen- 
tations of OS' are either not communication-efficient f{CT96 ACT99 ACTQQI or they 
require strong system assumptions [ILAF99 LFAOObI . Second, the 17 that we get this 
way is not necessarily stable. 

To the best of our knowledge, all prior implementations of OP require that 0{n^) 
links to be eventually timely. Larrea et al propose a communication-efficient transforma- 
tion of 17 to OP, but it requires all links to be eventually timely and it does not tolerate 
message losses jLarOCT . 

1.3 Summary of Contributions 

The contributions of the paper are the following: 

^ Actually, it is proportional to the maximum number of failures. 
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- We introduce the notion of stable leader election and describe the first leader election 
algorithm that is simultaneously stable and communication-efficient and requires 
only n eventually timely bidirectional links. 

- We modify our algorithm to work with message losses, first when processes have 
approximately synchronized clocks, and then when clocks are drift-free or have 
bounded drift. 

- We show how to achieve constant election time during good system periods. 

- We give an algorithm for OP that is both robust and efficient: it works in systems 
where only n bidirectional links are required to be eventually timely, and there is a 
time after which only n bidirectional links carry messages. 



1.4 Roadmap 

This paper is organized as follows. In Sect. |3we describe our model. In Sect. 0 we 
define the problem of stable and message-efficient fi leader election. In Sect.0, we give 
a simple algorithm for Q, and then modify it in Sect. |3to make it stable. In Sect.0 we 
give algorithms for fl that work despite message losses. Then, in Sect. [7J we explain 
how to obtain 17 with a constant election time when the system stabilizes. In Sect. 0 
we discuss view numbers. Finally, in Sect. 0 we give a new algorithm for OP that 
guarantees that, there is a time after which, only n bidirectional links carry messages. 

Because of space limitations in this extended abstract, we have omitted all technical 
proofs. They can be found in lADGITOI I . 

2 Informal Model 

We consider a distributed system with n > 2 processes U — {0, ... , n — 1} that can 
communicate with each other by sending messages through a set of links A. We assume 
that the network is fully connected, i.e., A = U x II. The link from process p to 
process q is denoted by p — q. The system is partially synchronous in that (1) links 
are sometimes timely (good) and sometimes slow, (2) processes have drift-free clocks 
(which may or may not be synchronized) and (3) there is an upper bound B on the time 
a process takes to execute a step. For simplicity, we assume that B = 0, i.e., processes 
execute a step instantaneously, but it is easy to modify our results for any B > 0. At each 
step a process can (1) receive a message, (2) change its state and (3) send a message. 
The value of a variable of a process at time t is the value of that variable after the process 
takes a step at time t. 

Processes and process failure patterns. Processes can fail by crashing, and crashes 
are permanent. A process failure pattern is function Fp that indicates, for each time t, 
what processes have crashed by t. We say that process p is alive at time t if p ^ Fp{t)- 
We say process p is correct if it is always alive. 

Link behavior pattern. A link behavior pattern is a function Fp that determines, for 
each time t, which links are good at t. The guarantees provided by links when they are 
good are specified by axiomatic links properties. These properties, given below, depend 
on whether the link is reliable or lossy. 
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Reliable links. Some of our basic algorithms require reliable links. Such links do 
not create, duplicate or drop messages. The link may sometimes he good and sometimes 
slow. If a process sends a message through a link and the link remains good for S (S is 
a system parameter known by processes) then the recipient gets the message within 6. 
More precisely, a reliable linkp — q G A satisfies the following properties: 

- (No creation or duplication): Process q receives a message m from p at most once, 
and only if p previously sent m to q. 

- (No late delivery in good periods): If p sends m to g by time t — S and p — > q is 
good during times [t — 6, t] then q does not receive m after time t. 

- (No loss): If p sends mto q then q eventually receives m from 

Lossy links. Like reliable links, lossy links do not create or duplicate messages and 
may be slow or not. However, unlike reliable links, they may drop messages when they 
are not good. A lossy link p — > q G A satisfies the following properties: 

- (No creation or duplication): Same as above. 

- (No late delivery in good periods): Same as above. 

- (No loss in good periods): If p sends m to q at time t — 5 and p — > q is good during 
times [t — 6, f] then q eventually receives m from p. 

Connectivity. In this paper, we focus on implementing (defined below). It is easy 
to show that this is impossible if links are never good. We thus assume that there exists at 
least one process whose links are eventually good. More precisely, we say that a process 
p is accessible at time t if it is alive at time t and all links to and from p are good at time 
We say that p is eventually accessible if there exists a time t such that p is accessible 
at every time after We assume that there exists at least one process that is eventually 
accessible. 

3 Stable Leader Election 

3.1 Specification of 17 

We consider a weak form of leader election, denoted 17, in which each process p has 
a variable leader p that holds the identity of a process or _L0 Intuitively, eventually all 
alive processes should hold the identity of the same process, and that process should be 
correct. More precisely, we require the following property: 

- There exists a correct process I and a time after which, for every alive process p, 
leader p = i. 

^ For convenience, in our model dead processes “receive” messages that are sent to them (but of 
course they cannot process such messages). 

For convenience, we assume that a process is not accessible at times t < 0. 

^ Note that eventually-forever accessible would be a more precise name for this property, but it 
is rather cumbersome. 

* The original definition of 17 does not allow the output to be _L. We allow it here because it is 
convenient for processes to know when the leader elector has not yet selected a leader. 
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If at time t, leaderp contains the same process i for all alive processes p, then we say 
that i is leader at time t. Note that a process p never knows if leaderp is really the leader 
at time t, or not. A process only knows that eventually leaderp is leader. This guarantee 
seems rather weak, but it is actually quite powerful: it can be used to solve consensus in 
asynchronous systems ICHT96II . 

3.2 Communication-Efficiency 

An algorithm for 17 is communication-efficient if there is a time after which it uses only 
n unidirectional links. All our 17 algorithms are communication-efficient. Actually, if we 
discount messages from a process to itself, our algorithms use only n — 1 links, which 
is optimal [I.TAOObI . 

3.3 Stability 

A change of leadership often incurs overhead to an application, which must react to deal 
with the new leader. Thus, we would like to avoid switching leaders as much as possible, 
unless there is a good reason to do so. For instance, if the leader has died or has been 
inaccessible to processes, it must be replaced. An algorithm that changes leader only 
in those circumstances is called stable. More precisely, a fc-stable algorithm guarantees 
that in every run, 

- if p is leader at time t and p is accessible during times \t — k5^t+ 1] then p is leader 
at time f -f 1 . 

Here, A: is a parameter that depends on the algorithm; the smaller the k, the better the 
algorithm because it provides a stronger stability property. We introduced parameter k 
because no algorithm can be “instantaneously” stable (0-stable) and 1-stable algorithms 
have serious drawbacks, as we show in the full paper lADGFTOl | . Our algorithm for 
reliable links is 3-stable while our best algorithm for lossy links is 6-stable. 

4 Basic Algorithm for O 

Figure [I] shows an algorithm for 17 that works in systems with reliable links. This 
algorithm is simple and communication-efficient but not stable — we will later modify 
it to get stability. Intuitively, processes execute in rounds r = 0, 1, 2, . . . , where variable 
r keeps the process’s current round. To start a round k, a process (1) sends {START, k) 
to a specially designated process, called the “leader of round fc”; this is just process 
k mod n, (2) sets r to k, (3) sets the output of 17 to fc mod n and (4) starts a timer — 
a variable that is automatically incremented at each clock tick. While in round r, the 
process checks if it is the leader of that round (task 0) and if so sends ( OK, r) to all every 

5 timeQ When a process receives an ( OK, k) for the current round (r = k), the process 

^ In this and other algorithms, we chose the sending period to be equal to the network delay 5. This 
arbitrary choice was made for simplicity of presentation only. In general, the sending period 
can be set to any value r], though, in this case, one needs to modify the algorithms slightly, e.g., 
by adjusting the time out periods. The choice of r] affects the quality of service of the failure 
detector ICTAOOI . such as how fast a leader is demoted if it crashes. 
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Code for each process p\ 

1 procedure StartRound{s) { executed upon start of a new round } 

2 if p 7 ^ s mod n then send {START, s) to s mod n { wake up new leader } 

3 r s { update current round } 

4 leader -4— s mod n { output of 17 } 

5 restart timer 



6 on initialization: 

7 StartRound(O) 

8 start tasks 0 and 1 

9 task 0: { leader sends OK every 5 time } 

10 loop forever 

11 if p = r mod n and have not sent {OK, r) within S then send {OK , r) to all 

12 task 1: 

13 upon receive ( OK, k) with fc = r do { current leader is active } 

14 restart timer 

15 upon timer > 25 do { timeout on current leader } 

16 StartRound{r + 1) { start next round } 

17 upon receive {OK, k) or {START, k) with fc > r do 

18 StartRound{k) { jump to round A: } 



Fig. 1. Basic algorithm for 17 with reliable links. 



restarts its timer. If the process does not receive {OK, r) for more than 2S time, it times 
out on round r and starts round r + 1. If a process receives {OK, k) or {START , k) 
from a higher round (fc > r), the process starts that round. 

Intuitively, this algorithm works because it guarantees that (1) if the leader of the 
current round crashes then the process starts a new round and (2) processes eventually 
reach a round whose leader is a correct process that sends timely ( OK, fc) messages. 

Theorem 1. Consider a system with reliable links. Assume some process is eventually 
accessible. Figure^is a communication-efficient algorithm for fi. 

5 Stable Algorithm for J7 

The algorithm of Fig. Q] implements Q but it is not stable because it is possible that (1) 
some process q is accessible for an arbitrarily long time, (2) all alive processes have q 
as their leader at time t, but (3) q is demoted at time f + 1. This could happen in two 
essentially different ways: 

Problem scenario 1. Initially all processes are in round 0 and so process 0 is the leader. 
All links are good (timely), except the links to and from process 2, which are very slow. 
Then at time 25+1, process 2 times out on round 0 and starts round 1, and so 0 loses 
leadership. Moreover, 2 sends {START, 1) to process 1. At time 25 + 2, process 2 
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crashes and process 0 becomes accessible. Message {START, 1) is delayed until some 
arbitrarily large time M ^ 2i5 + 2. At time M, process 1 receives {START, 1) and 
starts round 1, and thus process 0 is no longer leader. 

In this scenario, process 0 becomes the leader right after process 2 crashes, since 
all alive processes are then in round 0 and hence have 0 as their leader. Unfortunately 

0 is demoted at time M, even though it has been accessible to all processes during the 
arbitrarily long period [2S + 2, M], 

Problem scenario 2. We divide this scenario in two stages. (A) Setup stage. Initially 
process 1 times out on round 0, starts round 1 and sends {OK, 1) to all. All processes 
except process 0 get this message and start round 1 . Then process 3 times out on rounds 

1 and 2, starts round 3 and sends {OK, 3) to all. All processes except process 0 get 
that message and start round 3; process 0, however, remains in round 0 (because all 
messages from higher rounds are delayed). Then process 2 becomes accessible. All 
processes except 0 remain in round 3 for a long time, while 0 remains in round 0 for a 
long time. All processes except 0 then progressively time out on rounds 3, 4, . . . until 
they start round n + 2, say at time t. Meanwhile, process 0 receives the old {OK, 1) 
message, advances to round 1, timeouts on round 1 and starts round 2 at time t. (B) 
Demote stage. Note that at time t, process 2 is the leader because all processes are in 
a round congruent to 2 modulo n. Moreover, 2 has been accessible for a long time. 
Unfortunately, process 2 stops being the leader when process 0 receives {OK, 3) and 
starts round 3. 

Summary of bad scenarios. Essentially, scenario 1 is problematic because a single 
process may (1) time out on the current round, (2) send a message to move to a higher 
round and then (3) die. This message may be delayed and may demote the leader long in 
the future. On the other hand, scenario 2 is problematic because a process may become 
a leader while processes are in different rounds; after the leader is elected, a process in 
a lower round may switch its leader by moving to a higher round. 

Our new algorithm, shown in Fig.El avoids the above problems. To prevent problem 
scenario 1, when a process times out on round k, it sends a {STOP,k) message to 
k mod n before starting the next round. When k mod n receives such a message, it 
abandons round k and starts round fc + 1. To see why this prevents scenario 1, note that, 
before process 2 sends {START, 1) to 1, it sends {STOP, 0) to 0. Soon after process 0 
becomes accessible, it receives such a message and abandons round 0. 

To avoid problem scenario 2, when a process starts round k, it no longer sets leader 
to k mod n. Instead, it sets it to _L and waits until it receives two {OK, k) messages 
from k mod n. Only then it sets leader to k mod n. This guarantees that if k mod n is 
accessible and some process sets leader to k mod n, then all processes have received at 
least one {OK, k) and hence have started round k. In this way, all processes are in the 
same round k. 



Theorem 2. Consider a system with reliable links. Assume some process is eventually 
accessible. Figure^is a 3-stable communication-efficient algorithm for Q. 
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Code for each process p\ 

1 procedure StartRound{s) { executed upon start of a new round } 

2 if p 7 ^ s mod n 

3 then send {START, s) to s mod n { wake up the new leader candidate } 

4 r s { update current round } 

5 leader ^ _L { demote previous leader but do not elect leader quite yet } 

6 restart timer 



7 on initialization: 

8 StartRound(O) 

9 start tasks 0 and 1 

10 task 0: { leader/candidate sends OK every 5 time } 

11 loop forever 

12 if p = r mod n and have not sent ( OK, r) within S then send ( OK , r) to all 

13 task 1: 

14 upon receive {OK , k) with fc = r do { current leader/candidate is active } 

15 if leader = _L and received at least two {OK , k) messages 

16 then leader t— k mod n { now elect leader } 

17 restart timer 

18 upon timer > 25 do { timeout on cument leader/candidate } 

19 send {STOP, r) to r mod n { stop current leader/candidate } 

20 StartRound{r + 1) { start next round } 

21 upon receive {STOP, k) with fc > r do { current leader abdicates leadership } 

22 StartRound{k + 1) { start next round } 

23 upon receive {OK, k) or {START, k) with fe > r do 

24 StartRound{k) { jump to round A: } 



Fig. 2. 3-stable algorithm for 17 with reliable links. 



6 Stable 17 with Message Losses 

In our previous algorithm, we assumed that links do not lose messages, but in many 
systems this is not the case. We now modify our algorithms to deal with message losses. 
First note that if all messages can be lost, there is not much we can do, so we assume 
there is at least one eventually accessible process p. That means that p is correct and 
there is a time after which the links to and from p do not drop messages and are timely 
(see Sect. □). 

6.1 Expiring Links 

So far, our model allowed links to deliver messages that have been sent long in the past. 
This behavior is undesirable because an out-of-date message can demote a leader that 
has been good recently. To solve this problem, we now use links that discard messages 
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older than S — we call them expiring links. Such links can be easily implemented from 
“plain” lossy links if processes have approximately synchronized, drift-free clocks, by 
using timestamps to expire old messages. We now make these ideas more precise. 
Informal definition. Expiring links are lossy links that automatically drop old messages. 
To model such links, we change the property “No late delivery in good periods” of lossy 
links (Sect. El to require no late delivery at all times, not just when the link is good. 
More precisely, an expiring linkp — > q satisfies the following properties; 

- (No late delivery): If p sends m to g by time t — 5 then q does not receive m after t. 

- (No creation or duplication): (same as before) Process q receives a message m from 
p at most once, and only if p previously sent m to q. 

- (No loss in good periods): (same as before) If p sends m to g at time t — 5 and 
p — q is good during times [t — 6, t] then q eventually receives m from p. 

Implementation. If processes have perfectly synchronized clocks, we can easily imple- 
ment expiring links from plain lossy links as follows: (1) the sender timestamps m before 
sending it and (2) the receiver checks the timestamp and discards messages older than 6. 
This idea also works when processes have e-synchronized, drift-free clocks, though the 
resulting link will have a S parameter that is 2e larger than the S of the original links. It 
is also possible expire messages even if clocks are not synchronized, provided they are 
drift-free or have a bounded drift, as we show in the full paper [ADGFTOll . 
Henceforth, we assume that all links are expiring links (this holds for all our f2 algorithms 
that tolerate message losses). 



6.2 0(n)-Stable f2 

Figure 13 shows an {n + 4)-stable algorithm for 17 that works despite message losses. 
The algorithm is similar to our previous algorithm that assumes reliable links (Fig. 0, 
with only three differences: the first one is that in line|3 p sends the START message 
to all processes, not just to s mod n. The second difference is that there are no STOP 
messages. And the last difference is the addition of lines ED and E3 without these two 
lines, the algorithm would not implement 17 (it is not hard to construct a scenario in 
which the algorithm fails). 

This new algorithm is 0(n)-stable rather than 0(l)-stable. To see why, consider the 
following scenario. Initially all processes are in round 0. At time 2<5 -|- 1 the following 
happens: (1) process 2 times out on round 0 and attempts to sends {START, 1) to all, but 
crashes during the send and only sends to process 3 and (2) process 0 becomes accessible. 
At time 3<5 -I- 1, process 3 receives {START , 1) and tries to send {START , 1) to all, but 
crashes and only sends to process 4. And so on. Then at time nS + 1, process 0 is the 
leader but it receives {START , 1) from process n — 1 and demotes itself, even though 
it has been accessible during [26 + l,n6 + 1]. 

Theorem 3. Consider a system with message losses ( expiring links). Assume some pro- 
cess is eventually accessible. Figure^ is an {n 4) -stable communication-efficient 
algorithm for 17. 
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Code for each process p\ 

1 procedure StartRound{s) { executed upon start of a new round } 

2 if p 7 ^ s mod n then send {START, s) to all { bring all to new round } 

3 r <— s { update current round } 

4 leader _L { demote previous leader but do not elect leader quite yet } 

5 restart timer 

6 on initialization: 

7 StartRound(O) 

8 start tasks 0 and 1 

9 task 0: { leader/candidate sends OK every 5 time } 

10 loop forever 

11 if p = r mod n and have not sent ( OK, r) within S then send ( OK , r) to all 

12 task 1: 

13 upon receive {OK , k) with k = r do { current leader/candidate is active } 

14 if leader = _L and received at least two {OK , k) messages 

15 then leader t— k mod n { now elect leader } 

16 restart timer 

17 upon timer > 25 do { timeout on current leader/candidate } 

18 StartRound{r + 1) { start next round } 

19 upon receive {OK, k) or {START, k) with fc > r do 

20 StartRound{k) { jump to round A: } 

21 upon receive {OK, k) or {START, k) from q with fc < r do 

22 send {START, r)to q { update process in old round } 



Fig. 3 . (n + 4)-stable algorithm for O that tolerates message losses. 



6.3 0(1)-Stable 

Our previous algorithm can tolerate message losses but it is only 0(n)-stable. This can 
be troublesome if the number n of processes is large. We now provide a better algorithm 
that is 6-stable. We manage to get 0(l)-stability by ensuring that a leader is not elected 
if there are long chains of messages that can demote the leader in the future. 

Our algorithm is shown in Fig. 0 It is identical to our previous algorithm, except 
that there is a new {ALERT , k) message. This message is sent to all when a process 
starts round k. When a process receives such a message from a higher round, the process 
temporarily sets its leader variable to _L for 6S time units. However, unlike with a START 
message, the process does not advance to the higher round. 

Theorem 4. Consider a system with message losses ( expiring links). Assume some pro- 
cess is eventually accessible. Figure^is a 6-stable communication-efficient algorithm 
for fi. 
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Code for each process p: 

1 procedure StartRound{s) { executed upon start of a new round } 

2 send {ALERT, s) to all 

3 it p ^ s mod n then send {START, s) to all { bring all to new round } 

4 r s { update current round } 

5 leader _L { demote previous leader but do not elect leader quite yet } 

6 restart timer 



7 on initialization: 

8 StartRound{0) 

9 start tasks 0 and 1 

10 task 0: { leader/candidate sends OK every 5 time } 

11 loop forever 

12 if p = r mod n and have not sent {OK , r) within 5 then send {OK, r) to all 

13 task 1: 

14 upon receive {OK , k) with k — r do { current leader/candidate is active } 

15 if leader = _L and received at least two {OK, k) messages 

16 and did not receive {ALERT, k') with k' > k within 65 

17 then leader k mod n { now elect leader } 

18 restart timer 

19 upon timer > 25 do { timeout on current leader/candidate } 

20 StartRound{r + 1) { start next round } 

21 upon receive {OK , k) or {START, k) with fc > r do 

22 StartRound{k) { jump to round k } 

23 upon receive {OK , k) or {START, k) from q with fc < r do 

24 send {START , r) to q { update process in old round } 

25 upon receive {ALERT, k) with fc > r do 

26 leader _L { suspend current leader } 



Fig. 4. 6-stable algorithm for 17 that tolerates message losses. 



7 Stable f2 with Constant Election Time 

In some applications, it is important to have a small election time — the time to elect 
a new leader when the system is leaderless. This time is inevitably large if there are 
crashes or slow links during the election. For instance, if an about-to-be leader crashes 
right before being elected, the election has to start over and the system will continue to 
be leaderless. Slow links often cause the same effect. 

It is possible, however, to ensure small election time during good periods — periods 
with no slow links or additional crashes. In such periods, the election time of our previous 
algorithms is proportional to /, the number of crashes so far. This is because processes 
may go through / rounds trying to elect processes who are long dead. With a simple 
modification, however, it is possible to do much better and achieve a constant election 
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20 send {ALERT, r + 1) to all 

20. 1 send {PING, r) to all { ask who is alive } 

20.2 wait for 25 time or until receive {OK, k) or {START, k) with k > r 

20.3 if received {OK, k) or {START, k) with k > r then return 

20.4 S •«— {q: received {PONG, r) from q } { responsive processes } 

{ we assume that p responds to itself immediately, so S is never 0 } 

20.5 k <r- smallest k' > r such that k' mod n G S { round of next responsive process } 

20.6 StartRound{k) { start next round } 

20.7 upon receive {PING, k) from q do 

20.8 send {PONG, k) to q 



Fig. 5. Improving the election time in the algorithm of Fig. 0 



time (independent of /). The basic idea is that, when a process wants to start a new 
round, it first queries all processes to determine who is alive. Then, instead of starting 
the next round, the process skips all rounds of unresponsive processes. Using this idea, 
we can get a 6-stable algorithm with an election time of 9i5, as follows: we take the 
algorithm of Fig.0|and replace its lineOTHwith the code shown in Fig.|3ll 

Theorem 5. Consider a system with message losses ( expiring links). Assume some pro- 
cess is eventually accessible. If we replace line\21in Fig.^with the code in Fig. Elwe 
obtain a 6-stable communication-efficient algorithm for 12. Its election time is 95 when 
there are no slow links or additional crashes. 



8 Leader Election with View Numbers 

It may be useful to tag leaders with a view number such that there is at most one leader 
per view number and eventually processes agree on the view number of the leader. More 
precisely, we define a varianf of f2, which we call 17+ , in which each process outpufs 
a pair {p, v) or _L, where p is a process and u is a number. 17+ guarantees that (1) if 
some process outputs {p, v) and some process outputs {q, v) then p = q and (2) there 
exists a correct process i, a number vg and a time after which, for every alive process p, 
p outputs {£, ve). 

It turns out that our 17 algorithms can be made to output a view number with no 
modifications: they can simply output the current round r. By doing that, it is not hard 
to verify that our algorithms actually implement 17+ . 



9 An Efficient Algorithm for OP 

Recall that, at each alive process p, the eventually perfect failure detector OP outputs 
a set of trusted processes, such that there is a time after which the set of p contains 

* The same idea can be applied to get a constant election with our other algorithms. 
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Code for each process p\ 

1 procedure StartRound(s) 

2 if p ^ s mod n then send {START, s) to all 

3 r ■<— s 

4 leader •<— s mod n 

5 trust <— n 

6 restart timer 

I on initialization: 

8 StartRound{0) 

9 start tasks 0 and 1 

10 task 0: { leader updates trust and sends OK every <5 time } 

II loop forever 

12 if p — r mod n and have not sent ( OK , r,*) within 5 then 

13 if p has been in round r for at least 26 time then 

14 foreachg C triist s.t. p did not receive r) from g in the last 25 time do trust •<— trust \ {g} 

15 send {OK , r, trust) to all 

16 task 1: 

17 upon receive {OK , k, tr) with k — r do 

18 ifp 0 tr then 5tarti?ound(r + 1) 

19 else 

20 trust tr 

21 send {ACK , r) to r mod n 

22 restart timer 

23 upon timer > 25 do 

24 StartRound{r + 1) 

25 upon receive {OK , k, tr) or {START, k) with fc > r do 

26 if received {OK, *, *) then send {ACK , k) to k mod n 

27 StartRound{k) { jump to round fc } 

28 upon receive {OK , k, tr) or {START , k) from q with fc < r do 

29 send {START, r) to q 



{ timeout on current leader } 
{ start next round } 



{ current leader is active } 
{ leader does not trust p, so p starts new round } 



{ executed upon start of a new round } 
{ update current round } 
{ trust all initially } 



Fig. 6. An efficient algorithm for OP. 



process q if and only if q is correct. We now give an algorithm for OP that is both robust 
and efficient: In contrast to previous implementations of OP, our algorithm works in 
systems where only n bidirectional links are required to be eventually timely, and there 
is a time after which only n bidirectional links carry messages. 

Our algorithm, shown in Fig. 0 tolerates message losses with expiring links. It is 
based on the algorithm in Fig. 01 and the difference is that (1) there are no ALERT 
messages and (2) there is a mechanism to get the list trust of trusted processes of OP: 
When processes receive OK, they send ACP to the leader, and the leader sets trust to the 
set of processes from which it received ACK recently. The leader then sends its trust to 
other processes, by piggybacking it in the OK messages. Upon receiving OK, a process q 
checks if the leader’s trust contains q. If so, the process sets its own trust to the leader’s. 
Else, the process notices that the leader has made a mistake, and so it starts the next 
round. 

We assume that if a process sends a message to itself, that message is received and 
processed immediately. 
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Theorem 6. Consider a system with message losses (lossy links). Assume some process 
is eventually accessible. Figure^is an algorithm for OP. With this algorithm, there is 
a time after which only n bidirectional links carry messages. 
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Abstract. We present a long-lived renaming algorithm in the read/write shared 
memory model. Our algorithm is adaptive to the point contention k and works with 
hounded memory and hounded values. We consider the renaming problem where 
each process obtains the new name in the range 1, ■ ■ • , k{2k — 1). In this paper, we 
present an algorithm with 0{k^) step complexity and 0{n^N) space complexity, 
where n and N are an upper bound of k and the number of processes, respectively. 
The previous best result under the same problem setting is the algorithm with 
O(k^) step complexity and O(n^W) space complexity presented by Afeket. al[jTj|. 
They also presented the algorithm with 0{k'^ log k) step complexity and 0{n^N) 
space complexity under the condition where unbounded values are allowed. That 
is, we improve the above two algorithms. 



1 Introduction 

We consider a long-lived M-renaming problem in an asynchronous read/write shared 
memory model. In the problem, every process repeatedly acquires a new name in the 
range and releases it after the use. The problem requires that no two 

processes keep the same name concurrently. The renaming algorithm is very useful in 
the situation where the number of active processes is much smaller than the number 
of total processes which have a potential for participation. Since the complexities of 
most distributed algorithms depend on the name space of processes, we can reduce the 
complexities by reducing the name space using the renaming algorithm. For this purpose, 
long-lived renaming algorithms are used in other long-lived applications such as atomic 
snapshot^, immediate snapshot^\, or collet^^. 

Renaming algorithms are required to have a small name space and low complexity. 
In the first renaming algorithms BI3|, their step complexities depend on the number N of 
the processes which have a potential for participation in the algorithms and each process 
is allowed to acquire the new name only once (one-time renaming). An algorithm is called 
to he fast if its step complexity depends on not N but the upper bound n of the number 
of the processes which actually participate in the algorithm. One-time fast renaming 
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Table 1. Adaptive long-lived renaming algorithms. 
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this paper 



algorithms lb] and long-lived fast renaming algorithms 17181 have been proposed. Fast 
renaming algorithms use the upper bound n in the algorithms and it needs to be known 
in advance. 

Recently, renaming algorithms where the step complexities depend on only the num- 
ber of the processes which actually participate in the algorithm were proposed. Such 
algorithms are said to be adaptive. In adaptive algorithms, the number of actually ac- 
tive processes is unknown in advance. TableQ] shows the results on adaptive long-lived 
renaming algorithms. Attiya et al.E3l and Afek et al. lllOl proposed one-time renaming 
algorithms which step complexity are functions of the interval contention, where the in- 
terval contention is the the number of active processes in the execution interval. Afek et 
al. lH 11 first proposed adaptive long-lived renaming, where some algorithms are adaptive 
to the interval contention and use bounded memory and others are adaptive to the point 
contention and use unbounded memory. The point contention k is the the maximum 
number of processes executing concurrently at some point. Afek et al.|[0 improved their 
previous algorithms |[ni- They proposed long-lived renaming algorithms adaptive to the 
point contention. Their algorithms are a (2fc^ — fc)-renaming algorithm with log k) 
step complexity and 0{n^N) space complexity using unbounded values, a (2/c^ — k)- 
renaming algorithm with O(fc^) step complexity and 0{n^N) space complexity us- 
ing bounded values, and linear name-space renaming algorithms with exponential step 
complexity and 0{n?N) space complexity. Attiya et al. fTTIl proposed point-contention- 
adaptive long-lived (2k — l)-renaming algorithm with 0{k'^) step complexity. 

In this paper, we present a point-contention-adaptive long-lived (2fc^ — fc)-renaming 
algorithm with O(k^) step complexity and 0{n?N) space complexity using bounded 
values. That is, our algorithm improves the previous two results of (2/c^ — fc)-renaming 
algorithms mi. Our algorithm is based on the (2k^ — fc)-renaming algorithm with O(k^) 
step complexity O. In the algorithm^], every process which wants to get a new name 
visits series of sieves and tries to win in some sieve. In each sieve, a procedure 
latticeAgreement0 is used to obtain a snapshot of participants in the sieve. We replace 
this procedure with a simple procedure, and can reduce both the step complexity and 
the space complexity. The replaced procedure is based on an adaptive collect algorithm 
presented in G1- 
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This paper is organized as follows. We give some definitions in Section 2, and briefly 
describe the previous (2fc^ — fc)-renaming algorithmic in Section 3. Sections 4 and 5 
show our efficient (2k^ — fc)-renaming algorithm and its correctness. Finally, we conclude 
this paper in Section 6. 



2 Definition 

Our computation model is an asynchronous read/write shard memory modelUJ. A 
shared memory model consists of a set of N asynchronous processes po,pi, - ■■ ,pn-i 
and a set of registers shared by the processes. Each process has a unique identifier, and 
we consider that an identifier of pi is i. The processes communicate each other only 
through the registers which provide two atomic operations write and read. We assume 
multi-writer multi-reader registers, that is, any process can write to and read from any 
registers. 

The definition of a long-lived renaming problem and a point contention are the same 
as the related works (H|. 

In the long-lived M-renaming problem, process repeatedly acquires and releases 
names in the range A renaming algorithm provides two procedures 

gelNamei and releaseNamei for each process pi. A process pi uses the procedure 
gelNamei to get a new name, and uses the procedure releaseNamei to release it. Each 
process pi alternates between invoking getNamei and releaseNamei, starting with 
gelNamej. 

An execution of an algorithm is a (possibly infinite) sequence of register operations 
and invocations and returns of procedures where each process follows the algorithm. Let 
a be an execution of a long-lived renaming algorithm, and let a' be some finite prefix of 
a. Process pi is active at the end of a' if a’ includes an invocation of gelNamei which 
does not precede any return of releaseNamei in a' . Process pi holds a name y at the 
end of a' if the last invocation of getNamei returned y and pi has been active ever since 
the return. A long-lived renaming algorithm should guarantee the following uniqueness: 
If active processes Pi and pj (i ^ j) hold names pi and yj, respectively, at the end of 
some finite prefix of some execution, then yi ^ yj . 

The contention at the end of a' , denoted Cont(a'), is the number of active processes 
at the end of a'. Let /3 be a finite interval of a, that is, a = ai/3a2 for some ai and 
a 2 - The point contention during /3, denoted PntCont{P), is the maximum contention 
in prefixes aif3' of ai/3. 

A renaming algorithm has adaptive name space if there is a function /, such that the 
name obtained in an interval of gelName, /3, is in the range {1, • • • , f{PntCont{P))}. 
Step complexity of a renaming algorithm is the worst case number of steps performed by 
some Pi in an interval f3 of getNamei and in the following releaseNamei . A renaming 
algorithm has adaptive step complexity if there is a bounded function S, such that the 
number of steps performed by pi in any interval (3 of gelNamei and in the following 
releaseNamei is at most S{PntCont{/3)). Space complexity of a renaming algorithm 
is the number of registers used in the algorithm. 
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3 O(fc^) -Algorithm 



Our renaming algorithm is based on the (2k^ — A:)-renaming algorithm presented in 
ID, which is a long-lived renaming algorithm adaptive to a point contention k with 
0{k^) step complexity using bounded memory and bounded values. We call this 0{k^)- 
algorithm. To present our algorithm, we briefly describe the O(fc^) -algorithm. Figure [fl 
shows top level procedures getName and releaseName in the O(fc^) -algorithm that 
we will present later. In this paper, we show algorithms in Fig J1 12131 and0 where all of 
them are pseudo codes for process pi. 

These pseudo codes in Fig[I]are the same as the 0(fc^)-algorithm except that a pro- 
cedure interleaved_sc_sieve returns a set of pairs < id,sp >. In the 0(fc^)-algorithm, 
this procedure returns a set of identifiers that are the first components of < id, sp >. In 
our O(fc^) -algorithm, to reduce both step and space complexities, we use different data 
structures from the 0(fc^)-algorithm inside lower level procedures mentioned later. The 
second component sp of < id, sp > is a pointer to the data structure where some infor- 
mation about a process pid is stored. In the 0(/c^)-aIgorithm, the value id is sufficient 
to point the corresponding data structure, while our proposed algorithm needs such a 
pointer like sp for efficient data space management. However, sp itself is used only in 
the procedures called in a procedure releaseName, and is not used the top level pro- 
cedures. This means the top level two procedures getName and releaseName behave 
in the same way between the 0(fc^)-algorithm and our 0(A:^) -algorithm. 

The difference between two algorithms are details of procedures 
interleaved_sc_sieve, leave and clear called in the top level procedures. Each 
process gets a new name and releases it as follows. 

getting a name: A process visits sieves 1,2,- •• until it wins in some sieve. Each 
sieve has 2N copies, where one copy is work space for processes which visit the sieve 
concurrently. A process pi visiting a sieve s enters one copy and obtains a set of process 
identifiers if it satisfies some conditions. If pi gets a non-empty set W of process iden- 
tifiers including its identifier i, pi wins in the sieve. If pi wins in s, Pi gets a name < s, 
the rank of pi in W > . 

releasing a name: A process pi leaves the copy from which pi got a name to show 
that Pi released the name. 

Each sieve has 2N copies 0, 1 • • • , 2N — 1. Each process uses a copy in a sieve s 
designated by a variable sieve[s]. count. The first component of the variable changed 
to 0, 1, • • • ,2A — 1,0, 1, • • • cyclically. We can associate a round with the value of 
sieve[s]. count which means how many times the variable is updated to the current 
value. If a process sees sieve[s]. count with a round r, we say that the process uses the 
designated copy in the round r. The 0(fc^)-aIgorithm guarantees the following. 

- The processes which enter the same copy in the same round get the identical non- 
empty set of process identifiers or an empty set. Moreover, Processes enter a copy 
after all names assigned from the previous copy are released. These imply the unique- 
ness of concurrently assigned names. 

- If at least one process enter some copy in some round, at least one process wins. 
This puts bounds to the number of sieves each process visits. 
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Shared variables : 

siet;e[l,...,2n— 1] { 

count : (integer, Boolean), initially (0,0); 
status[0, ...,N — 1] : Boolean, initially false; 

// 2N copies; each copy consists of inside, allDone and list. 
inside[0, ...,2N — 1] : Boolean, initially false; 
allDone[0, ...,2N — 1] : Boolean, initially false; 

// list is only for 0(fc^)-algorithm. 
list[0,...,2N-l] { 

mark [0, . . . , n — 1] : Boolean, initially false; 
view\0,...,n—l\ : set of (id,integer), initially _L; 
id[0, ...,n— 1] : id, initially _L; 

X[Q,...,n— 1] : integer, initially _L; 
y [0, ...,n — 1] : Boolean, initially false; 
done[0, ...,n — 1] : Boolean, initially false; 

}} 



Non-shared Global variables : 

nextC, c : integer, initially 0; nextDB, dirty B : Boolean, initially 0; 
sp : integer, initially _L; 

W : view (set of (id,integer)), initially 0; s : integer, initially 0; 

procedure getNameO 

1 s = 0; 

2 while (true) do 

3 s-H-H; 

4 sieve[s].status[i] = active; 

5 (c, dir ty B) = sieve[s]. count; 

6 nextC = c+ 1 mod 2N; 

I if (nextC = 0) then nextDB = not dirty B; 

8 else nextDB = dirty B; 

9 if {{nextC mod N = i)or {sieve[s].status[nextC mod A^] = idle)) then 

10 W = interleaved_sc_sieve(sieue[s] , nextC, nextDB); 

II if ((*, sp) £ W for some sp) then 

12 sieve\s\.ccmnt = {nextC, nextDB); 

13 return (s, rank of i in ly); 

14 else-if {sieve[s].allDone\nextC] = nextDB) then c\e3ir{sieve,nextC); 

15 sieve [sj.staf MS [i] = idle; 

16 od; 

procedure releaseNameO 

17 leave(sieve[s],nea;fC,nea;f-D-B); 

18 if {sieve[s].allDone[nextC] = nextDB) then 

19 c\ear{sieve,nextC); 

20 sieve[s].siafMs[i] = idle; 



Fig. 1. Outline of O(fc^) -algorithm and O(fe^) -algorithm. 
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- When a process pi is the inside of a sieve s, no other process can enter the copies i 

and i + N . This puts bounds to the number of copies used concurrently in one sieve, 

and guarantees that no copy is used concurrently in the different rounds. 

Each copy is initialized to be reused in the next round. When a process pi is leaving 
the copy c of a sieve s (lines 14 and 18), pi initializes c if all the names assigned from 
c are released. A Boolean variable sieve\s\.allDone\c\ is used as a signal that a round 
in the copy has been finished. The value nextDB differs with the parity of the round, 
(lines 7 and 8). 

In the 0(/c^)-algorithm, each process can win in some sieve among the first (2fc — 
1) sieves, and enters at most one copy in each sieve. In each copy, processes try to 
get some non-empty set of process identifiers by using procedures latticeAgreement 
and candidates, where latticeAgreement returns a snapshot of participants of the 
procedure and candidates returns the minimum snapshot among the snapshots obtained 
by latticeAgreement. In the O(fc^) -algorithm, in each copy, each process executes 
latticeAgreement once with 0{k\ogk) steps, and executes a procedure clear at most 
once to initialize the copy with O(fc^) steps. Therefore, the step complexity of the 
O(fc^) -algorithm is O(fc^). 

The space complexity is as follows. The algorithm uses 2n — 1 sieves, each of which 
has 2N copies. Each copy has 0{ii?) registers for latticeAgreement which dominates 
the space complexity of a copy. Therefore, the total space complexity is 0{n^N). Since 
each process gets a new name < seive, its rank inW > where W is a non-empty set 
obtained in the sieve where the process wins, the name space is {2k — l)k = 2k^ — k. 

4 O(k^) -Algorithm 

We replace the latticeAgreement and the candidates in the 0(fc^)-algorithm. Espe- 
cially, we replace the latticeAgreement with a simple procedure, and achieve O(fc^) 
step complexity and 0{n?N) space complexity while using bounded values. We call 
our algorithm 0(/c^)-algorithm. Eigures|^ and 0 show procedures used in the 0{k^)- 
algorithm. 

In the 0(fc^)-algorithm, processes which concurrently enter the same copy obtain the 
identical set W if they obtain non-empty set. To achieve this, procedures latticeAgree- 
ment and candidates are used. Each process entering a copy invokes latticeAgreement 
to capture a snapshot (i.e., the values at some point) of processes which have entered 
the same copy in the same round. Then the process invokes candidates to find the 
minimum snapshot among the snapshots obtained by latticeAgreement. The procedure 
candidates can returns the minimum snapshot only when the minimum snapshot can 
be identified, where “minimum” means the minimum one among snapshots obtained in 
the same copy in the same round including snapshots obtained by other processes later. 
Since the minimum snapshot is unique, some processes can obtain the identical non- 
empty set W. If k processes execute latticeAgreement concurrently, 0{k^) registers 
are used. To initialize these register, some process takes 0{k^) steps in one sieve. This 
dominates the step complexity of the 0(fc^)-algorithm. 

In our 0(/c^)-algorithm, some processes obtain a snapshot of processes registered 
in the copy (procedure partiaLscan). This can achieved by invoking collect twice. 
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Then, processes find the minimum snapshot by invoking candidates if the minimum 
snapshot can be identified. Though our algorithm is simple, it guarantees that at least 
one processes obtain the non-empty identical set. Moreover, our procedures use 0{k) 
registers if k processes enter the same copy in the same round. Therefore, 0{k) steps 
are sufficient to initialize the copy. This reduces the step complexity. 

We explain a procedure interleaved_SC_sieve (Fig|3). A process enters a copy c, if 
all the names assigned from the previous copy are released and the current copy is free 
(line 30). The process can notice that all the names assigned from the previous copy 
(c — 1 mod 2N) are released by checking a variable sieve. allDone[c— 1 mod 2W]. The 
Boolean variable sieve.allDone[c] changes after all names assigned from a copy c in a 
sieve sieve are released. 

If a process obtains a non-empty set W of process identifiers, LL is a set of candidates 
of winners in this sieve. After the all candidates left this copy, the copy is initialized to 
be reused in the next round (clear). However, some slow process excluding W may 
still work in the copy after the initialization started. Therefore, after every operation to 
shared registers, each process checks whether the copy has been finished or not. If the 
process notices that the copy has been finished, it initializes the last modified register 
and leaves the sieve. This mechanism is implemented by interleave (line 22) which is 
the same as the -algorithm. 

A process entering a copy c scans c. To scan a copy, our algorithm uses total- 
contention-adaptive collect and register with 0{k) step complexity presented in Ill3l . 
where the total contention is the number of active processes in an execution of the algo- 
rithm. In the 0(fc^)-algorithm, these two procedures are executed by only the processes 
which enter the same copy concurrently. They all read false from sieve. inside\nextC] 
and then write true to it (lines 30 and 31). This means they are concurrently active in 
a point immediately after they all read the value false from sieve. inside\nextC] and 
before any process writes the value true to it. Therefore, we can use these procedures 
as point contention adaptive procedures. 

To scan a copy c, processes try to register to c at first. In the collect presented in 111 31 . 
every process can register to a collect tree. The collect tree is a binary tree where each 
node is a splitter presented in l6j. A process which enters a splitter exits with either stop, 
left or right. It guarantees that, among I processes which enter the same splitter, (1) at 
most one process obtains stop, (2) at most I — 1 processes obtain left, and (3) at most I — 1 
processes obtain right. Since it is enough to get non-empty set of process identifiers, we 
do not need for all processes to register. Therefore, we simplify the register presented in 
o. We use only one direction of a splitter, and use a collect list in stead of the collect 
tree. The modified splitter returns stop, next or abort in stead of stop, left or right, 
respectively. The property of the splitter implies that (1) at least one process register at 
some node in a collect list, and (2) at most one process resister at each splitter. Figure 0] 
shows procedures register, collect and splitter, and Figure Elshows a collect list.. 

In the procedure collect, a process gets a set of process identifiers which are regis- 
tered. We call such a set a view. Our collect is the same as collectOl except that we 
search in the list. In collect, a process just searches the list from its root until it reaches 
an unmarked splitter. The collect returns a view consisting of all process identifiers 
which registered before invoking the collect and some process identifiers which register 
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Non-shared Global variables : 

last-modified : points to last shared variable modified by pi; 

// last-modified is assumed to be updated 
// immediately before the write. 

mysplitter : integer, initially _L; 

procedure interleaved_sc_sieve(siene, nextC, nextDB) 

// interleave is a two part construct. 

// Part I of the interleave is executed 

// after every read or write to a shared variable in Part II, 

// the SC_sieve() and any procedure recursively called from SC_sieve(). 

21 last-modi fied = 

22 interleave { // Part I 

23 if {sieve.allDone[nextC] = nextDB) then 

24 if (last-modified ^ _L) then 

25 write initial value to last-modified', 

26 return 0; // abort current SC_sieve(), s, and continue to next sieve. 

27 }{//PartII 

28 return sc_sieve(siene, nextC, nextDB)', 

29 } 

procedure sc_sieve(siene, nextC, nextDB) 

30 if (previousFinish(siene,nexf(7,nea;fDi3) and sieve.inside[nextC] = false) then 

31 sieve.inside[nextC] 

32 mysplitter = register(sieue.Zisf[nea;fC']); 

33 if (mysplitter ^ _L) then 

34 sieve.list[nextC].view[my splitter] = partiaLscan(siewe.Zisf[nea;fC]); 

35 W = can6\da\es(sieve,nextCy, 

36 if ((pi, my splitter) € W) then return W', 

37 sieve.list[nextC\.done[mysplitter\ = true; 

38 W = cand\da\es(sieve,nextC)', 

39 \eava(sieve, nextC, nextDB)', 

40 return 0; 

procedure previousFinish(sieue, nextC, nextDB) 

41 if (nextC 0 and sieve. all Done[nextC — 1 mod 2N] = nextDB) 

42 then return true; 

43 if (nextC = 0 and sieve. all Done]2N — 1] 7 ^ nextDB) 

44 then return true; 

45 retuen false; 



Fig. 2. Procedures of O(fc^) -algorithm: part I. 



concurrently with the execution of collect. Let V be a view obtained by an execution 
of collect, and Vi and V 2 be sets of process identifiers which has registered immedi- 
ately before and after the execution, respectively. The register and collect guarantee 
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procedure partiaLscan(Zist) 

46 Vi = collect(Hst); 

47 V 2 = collect(HsO; 

48 if (Vi = V 2 ) then return Vi; 

49 else return 0; 

procedure candidateslsieue, copu) 

50 sp = 0;V = 0; 

51 while (sieve.list[copy].mark[sp] = true) do 

52 if (sieve.list[copy].view[sp] ^ _L) then 

53 V = V U {sieve.list[copy].view[sp]}; 

54 sp-i-i-; 

55 od; 

56 if y = 0 then return 0; 

57 U = min{viewlview € V and view 0}’ 

58 if t/ 7 ^ 0 and for every (j, sp) G U, sieve.list[copy].view[sp] 3 U 

or sieve.list[copy\.view[sp] = 0 then 

59 return U ; 

60 else 

61 return 0 ; 

procedure c\edi'C{sieve,nextC) 

62 sieve.inside[nextC] = false; 

63 sp = 0; 

64 while (sieve.list[nextC].mark[sp] = true) do 

65 write initial value to a splitter sp in sieve. list\nextC]\ 

66 sp-l-l-; 

67 od; 

procedure leave(siet)e, nextC, nextDB) 

68 sieve.list\nextC].done\mysplitter\ = true; 

69 if IL 7 ^ 0 and for every {j, sp) G W, sieve.list[nextC\.done\sp\ = true then 

70 sieve. all Done[nextC] = nextDB\ 



Fig. 3. Procedures of 0(fc^)-algorithm: part II. 

Vi C IG c V 2 . This implies that if a process obtains the identical set by consecutive two 
invocations of collect, the set is a snapshot of processes which have been registered at 
some point between two collects. A procedure candidates returns the minimum snap- 
shot. If a process obtains a view including its identiher by the candidates, the process 
becomes a winner. 



5 Correctness of O ( ) - Algorithm 

We briefly show the correctness of our algorithm. Since the outline of the 0{k^)- 
algorithm is the same as the 0(fc^)-algorithm, it is enough to show the procedure 
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procedure register(list) 

71 sp = 0; 

72 while (true) 

73 list.mark[sp]=ir\xe\ 

74 move = splitter((ist, sp); 

75 if {move = next) then 

76 sp++; 

77 if {move = abort) then 

78 return _L; 

79 if {move = stop) then 

80 list.id[sp] = i; 

81 return sp\ 

82 od; 

procedure COllect(Zist) 

83 sp = 0-,V = 0; 

84 while {list.mark[sp\ = true) 

85 if {list.id[sp] / _L) then V = V U {{list.id[sp\, sp)} 

86 sp++\ 

87 od; 

88 return V ; 

procedure splitter(Zist, currentsp) 

89 list.X[currentsp] = Z; 

90 if {li8t.Y[currentsp\ = true) then return abort; 

91 list.Y[currentsp] = true; 

92 if {list.X[currentsp] = i) then return stop; 

93 else return next; 



Fig. 4. Collect. 



partiaLscan and canditates in the 0(fc^)-algorithm work well on behalf of latticeA- 
greement and canditates in the 0(A;^) -algorithm. We show that processes which enter 
the same copy in the same round get the identical non-empty set of process identifiers 
or an empty set and show that at least one process enter some copy in some round, at 
least one process wins (obtains a set including its identifier). 

We show some basic properties for partiaLscan and candidates. By the same 
access control as the 0(A:^) -algorithm, our 0(A:^)-algorithm guarantees that all processes 
entering a copy in some round leave and the copy is initialized before the copy is used in 
the next round. Therefore, the behavior of some copy in some round is independent of 
the behavior of the previous rounds in the copy. The following lemmas concern a copy 
in one round. 



Lemma 1. IfWi and W 2 are non-empty set returned by invocations 0 / candidate /or 
the same copy c in the same round of the same sieve s then W\ = W 2 - 
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Fig. 5. Collect list. 



Proof. We prove this lemma by contradiction. Assume Wi ^ W 2 - Since Wi and W 2 are 
snapshots of processes which have registered, Wi C W 2 or W 2 C Wi holds. We assume 
Wi C W 2 without loss of generality. A snapshot returned by candidate is a snapshot 
obtained by some process using partiaLscan in the same copy in the same round. Let 
Pj be a process which obtains a snapshot Wi by partiaLscan, and pm be a process 
which obtains W2 by candidates. Since pj searches a collect list after it registered at 
some splitter sp, Pj G Wi holds. Since pj is the only process which updates the variable 
s.list[c].view[sp] in this round, the value of s .list[c].view[sp] must be the initial value 
_L or Wi- However, pm sees that s.list[c].view[sp] 3 W 2 or s.list[c].view[sp] = 0. A 
contradiction. 

By LemmaQ] and the fact that each copy is used after all the names assigned from 
the previous copy are released, we can show the following uniqueness. 

Lemma 2. If active processes Pi and Pj (i j) hold names pi and pj, respectively, at 
the end of some finite prefix of some execution, then pi ^ pj. 

The following Lemmas are used to give an upper bound of the number of sieves to 
which each process visits. 

Lemma 3. If one or more processes enter a copy c of a sieve s in some round, at least 
one process obtains a snapshot by partiaLscan in c in this round. 

Proof. We prove the lemma by contradiction. Assume that no process obtain a snapshot. 
In this case, no process writes non-empty set to a variable s.list[c\.view\sp] for any 
splitter sp, obtains no-empty set by candidates, and writes nextDB to the variable 
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s .allDone[nextC] . Since a copy is initialized aftsr s. all Done[nextC] is set to nextDB, 
no process initializes the copy. Let pi be the last process which writes its identifier to a 
variable s.list[c] .id[sp] of some splitter sp in the copy c in register. The process pi then 
executes collect twice in partlaLscan. Since a set of processes which have registered 
does not changes after pi registered, pi can obtains a snapshot. A contradiction. 



Lemma 4. If one or more processes enter a copy c of a sieve s in some round, at least 
one process wins in c in this round. 



Proof. LemmaElshows that at least one process obtains a snapshot by partlaLscan in c 
in this round. Let W be the minimum snapshot obtained in c in this round, and pi be the 
last process in W which writes a value to s. list [c] .view [sp] in some splitter sp. Since W 
is the minimum, every process pj in W obtains a snapshot not smaller than W or fails 
to obtain a snapshot, pj writes a view W' in its splitter such that W C W' or VL = 0. 
The process pi can see these values in candidate and return W including pi. That is, pi 
wins in c in this round. 

The above LemmaQis used to show similar Lemmas to Lemmas 3.1 and 3.4 in |H|, 
and we can show the following. 

Lemma 5. Every process p wins in sieve at most 2k — 1, where k is the point contention 
of p’s interval o/getName. 

The step complexity of the O(fc^) -algorithm is as follows. In getName, each process 
Pi visits to at most 2k — 1 sieves, and has access to and enters at most one copy in each 
sieve. For each copy, Pi invokes one register, two collect, and at most one clear. Each 
procedure has 0{k) step complexity, and therefore, total step complexity is O(fc^). The 
algorithm uses 2n — 1 sieves, 2N copies of each sieve, and 0{n) registers for each copy. 
Therefore, the space complexity is 0{n^N). 

Theorem 1. The 0{k^) -algorithm solves the point contention adaptive long-lived 
(2k^ — kfrenaming problem with 0{k^) step complexity and O^nfN) space complexity 
using bounded values. 



6 Conclusions 

We presented a long-lived {2k^ — fc) -renaming algorithm which is adaptive to point 
contention k and uses bounded memory and bounded values in the read/write shared 
memory model. The step complexity is 0{k^) and the space complexity is 0{rfN), 
where n and N are an upper bound of k and the number of processes, respectively. 

One of the open problems is whether we can develop an adaptive long-lived renaming 
algorithm where the number of registers used in the algorithm depends on not N but n. 
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Abstract. We have a new proof of the lower bound that fc-set agree- 
ment requires [f/k\ -|- 1 rounds in a synchronous, message-passing model 
with / crash failures. The proof involves constructing the set of reachable 
states, proving that these states are highly connected, and then appeal- 
ing to a well-known topological result that high connectivity implies that 
set agreement is impossible. We construct the set of reachable states in 
an iterative fashion using a round operator that we define, and our proof 
of connectivity is an inductive proof based on this iterative construction 
and using simple properties of the round operator. This is the shortest 
and simplest proof of this lower bound we have seen. 



1 Introduction 

The consensus problem m has received a great deal of attention. In this prob- 
lem, n + 1 processors begin with input values, and all must agree on one of 
these values as their output value. Fischer, Lynch, and Paterson CH surprised 
the world by showing that solving consensus is impossible in an asynchronous 
system if one processor is allowed to fail. This leads one to wonder if there is any 
way to weaken consensus to obtain a problem that can be solved in the presence 
of fc — 1 failures but not in the presence of k failures. Chaudhuri defined 
the A:-set agreement problem and conjectured that this was one such problem, 
and a trio of papers wm\ proved that she was right. The fc-set agreement 
problem is like consensus, but we relax the requirement that processors agree: 
the set of output values chosen by the processors may contain as many as k 
distinct values, and not just 1. Consensus and set agreement are just as interest- 
ing in synchronous models as they are in asynchronous models. In synchronous 
models, it is well-known that consensus requires f + 1 rounds of communication 
if / processors can crash nm, and that A:-set agreement requires [f /k\ -|- 1 
rounds |Z]. These lower bounds agree when A: = 1 since consensus is just 1-set 
agreement. In this paper, we give a new proof of the [//fcj -1-1 lower bound for 
set agreement in the synchronous message-passing model with crash failures. 

All known proofs for the set agreement lower bound depend — either explic- 
itly or implicitly — on a deep connection between computation and topology. 

J. Welch (Ed.): DISC 2001, LNCS 2180, pp. I.lfi- ITHTI 2001. 
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These proofs essentially consider the simplicial complex representing all pos- 
sible reachable states of a set agreement protocol, and then argue about the 
connectivity of this complex. These lower bounds for set agreement follow from 
the observation that set agreement cannot be solved if the complex of reachable 
states is sufficiently highly-connected. This connection between connectivity and 
set agreement has been established both in a generic way HH and in ways spe- 
cialized to particular models of computation [2I4I7I13I14I15I2I] . Once the con- 
nection has been established, however, the problem reduces to reasoning about 
the connectivity of a protocol’s reachable complex. 

Most of the prior work employing topological arguments has focused on the 
asynchronous model of computation, in which processors can run at arbitrary 
speeds, and fail undetectably. Reasoning about connectivity in the asynchronous 
model is simplified by the fact that the connectivity of the reachable complex 
remains unchanged over time. Moreover, the extreme flexibility of the processor 
failure model facilitates the use of invariance arguments to prove connectivity. 
In the synchronous model that we consider here, analyzing connectivity is sig- 
nificantly more complicated. The difficulty arises because the connectivity of the 
reachable complex changes from round to round, so the relatively simple invari- 
ance arguments used in the asynchronous model cannot possibly work here. 

The primary contribution of this work is a new, substantially simpler proof 
of how the connectivity of the synchronous complex evolves over time. Our proof 
depends on two key insights: 



1. The notion of a round operator that maps a global state to the set of global 
states reachable from this state by one round of computation, an operator 
satisfying a few simple algebraic properties. 

2. The notion of an absorbing poset organizing the set of global states into 
a partial order, from which the connectivity proof follows easily using the 
round operator’s algebraic properties. 

We believe this new proof has several novel and elegant features. First, we are 
able to isolate a small set of elementary combinatorial properties of the round 
operator that suffice to establish the connection with classical topology. Second, 
these properties require only local reasoning about how the computation evolves 
from one round to the next. Finally, most connectivity arguments can be difficult 
to follow because they mix semantic, combinatorial, and topological arguments, 
but those arguments are cleanly separated here: The definition of the round 
operator captures the semantics of the synchronous model, the reasoning about 
the round operator is purely combinatorial, and the lower bound is completed 
with a “black box” application of well-known topological results without any 
need to make additional topological arguments. 

In the next section, we give an overview of our proof strategy and discuss its 
relationship to other proofs appearing in the literature. In the main body of the 
paper, we sketch the proof itself. The full proof in the full paper fills just over 
a dozen pages, making it the shortest self-contained proof of this lower bound 
that we have seen. 
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Fig. 1. A global state S and the set TZi{S) of global states after one round from S. 



2 Overview 

We assume a standard synchronous message-passing model with crash failures |21 
E], The system has n + 1 processors, and at most / of them can crash in any 
given execution. Each processor begins in an initial state consisting of its input 
value, and computation proceeds in a sequence of rounds. In each round, each 
processor sends messages to other processors, receives messages sent to it by the 
other processors in that round, performs some internal computation, and changes 
state. We assume that processors are following a full-information protocol, which 
means that each processor sends its entire local state to every processor in every 
round. This is a standard assumption to make when proving lower bounds. A 
processor can fail by crashing in the middle of a round, in which case it sends its 
state only to a subset of the processors in that round. Once a processor crashes, 
it never sends another message after that. 

We represent the local state of a processor with a vertex labeled with that 
processor’s id and its local state. We represent a global state as a set of labeled 
vertexes, labeled with distinct processors, representing the local state of each 
processor in that global state. In topology, a simplex is a set of vertexes, and a 
complex is a set of simplexes that is closed under containment. Applications of 
topology to distributed computing often assume that these vertexes are points 
in space and that the simplex is the convex hull of these points in order to be 
able to use standard topology results. As you read this paper, you might find it 
helpful to think of simplexes in this way, but in the purely combinatorial work 
done in this paper, a simplex is just a set of vertexes. 

As an example, consider the simplex and complex illustrated in Figure E 
On the left side, we see a simplex representing an initial global state in which 
processor P, Q, and R start with input values 0, 2, and 1. Each vertex is labeled 
with a processor’s id and its local state (which is just its input value in this case) . 
On the right we see a complex representing the set of states that arise after one 
round of computation from this initial state if one processor is allowed to crash. 
The labeling of the vertexes is represented schematically by a processor id such 
as P and a string of processor ids such as PQ. The string PQ is intended to 
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represent the fact that P heard from processors P and Q during the round but 
not from R, since R failed that round. (We are omitting input values on the right 
for notational simplicity.) The simplexes that represent states after one round 
are the 2-dimensional triangle in the center and the 1-dimensional edges that 
radiate from the triangle (including the edges of the triangle itself). The central 
triangle represents the state after a round in which no processor fails. Each edge 
represents a state after one processor failed. For example, the edge with vertexes 
labeled P; PQR and Q; PQ represent the global state after a round in which R 
fails by sending a message to P and not sending to Q: P heard from all three 
processors, but Q did not hear from R. 

What we do in this paper is define round operators like the round operator TZ^ 
that maps the simplex S on the left of Figured to the complex 7^i(5') on the 
right, and then argue about the connectivity of TZi(S). Informally, connectivity 
in dimension 0 is just ordinary graph connectivity, and connectivity in higher 
dimensions means that there are no “holes” of that dimension in the complex. 
When we reason about connectivity, we often talk about the connectivity of a 
simplex S when we really mean the connectivity of the induced complex con- 
sisting of S and all of its faces. For example, both of the complexes in Figure d 
are 0-connected since they are connected in the graph theoretic sense. In fact, 
the complex on the left is also 1-connected, but the complex on the right is not 
since there are “holes” formed by the three cycles of 1-dimensional edges. The 
fundamental connection between fc-set agreement and connectivity is that fc-set 
agreement cannot be solved after r rounds of computation if the complex of 
states reachable after r rounds of computation is (fc — l)-connected. In the re- 
mainder of this overview, we sketch how we define a round operator, and how 
we reason about the connectivity of the complex of reachable states. 



2.1 Round Operators 

In the synchronous model, we can represent a round of computation with a round 
operator that maps the state S at the start of a round to the set TZ({S) of 
all possible states at the end of a round in which at most i processors fail. 
Suppose F is the set of processors that fail in a round, and consider the local 
state of a processor p at the end of that round. The full-information protocol 
has each processor send its local state to p, so p receives the local state of each 
processor, with the possible exception of some processors in F that fail before 
sending to p. Since each processor q sending to p sends its local state, and since 
this local state labels q's vertex in S, we can view p’s local state at the end of the 
round as the face of S containing the local states p received from processors like q. 
If we define S/F to be the face of S obtained by deleting the vertexes of S labeled 
with processors in F, then p receives at least the local states labeling S/F, so p’s 
local state after the round of computation can be represented by some face of S 
containing S/F. 

This intuition leads us to define the round operator TZ^ as follows. For each 
set F of at most i processors labeling a state S, define TZp{S) to be the set of 
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simplexes obtained by labeling each vertex of S/ F with some face of S contain- 
ing S/F. This is the set of possible states after a round of computation from S 
in which F is the set of processors that fail, since the processors labeling S/F 
are the processors that are still alive at the end of the round, and since they 
each hear from some set of processors that contains the processors labeling S/F. 
The round operator TZi(S) is defined to be the union of all TZp{S) such that F 
is a set of at most £ processors labeling S. 

To illustrate this informal definition, consider the complex TZi{S) of global 
states on the right of Figure D This complex is the union of four rather de- 
generate pseudospheres, which are complexes defined in Section 15.1 1 that are 
topologically similar to a sphere. The first pseudosphere is the central triangle. 
This is the pseudosphere TZf^(S)., where each processor hears from all other pro- 
cessors, so each processor’s local state at the end of the round is the complete 
face {P,Q,R} of S. The other three pseudospheres are the cycles hanging off 
the central triangle. These are the pseudospheres of the form TZ^p^{S), where 
each processor hears from all processors with the possible exception of P, so 
each processor’s local state at the end of the round is either the whole sim- 
plex S = {P, Q, R} or the face S/ {P} = {Q, R}, depending on whether the 
processor did or did not hear from P. 

If 72.^ (S') is the set of possible states after one round of computation, then 
TZ/{S) = TZ^TZ/~^{S) is the set of possible states after r rounds of computation. 
The goal of this paper is to prove that TZ^{S) is highly-connected. 

2.2 Absorbing Posets 

To illustrate the challenge of proving that TF/{S) is connected, let us assume 
that TZ^{S) is ^connected for every S and £, and let us prove that 72^72^(S) 
is ^-connected. If 72^ (S) = {Si, ... , S^} is the set of states after one round, then 

TZ^TZ^{S) = 72^(Si U S 2 U • • • U Sk) 

= 72f (Si) U 72^(S2) U • • • U 72^(Sfc) 

is the set of states after two rounds. We know that the 72^(Si) are ^-connected 
by assumption, but we need to prove that their union is ^-connected. 

Proving that a union of complexes is connected is made easier by the Mayer- 
Vietoris theorem, which says that AUB is c-connected if A and B are c-connected 
and A n 73 is (c — l)-connected. This suggests that we proceed by induction on i 
to prove that 

72^(Si)U72^(S2)U---U72^(S,) 
is connected for i = 1, . . . ,k. We know that 

72^(Si) U 72^(S2) U • • • U TZi{Si-i) and 72^(Si) 

are both 7-connected by hypothesis and assumption, so all we need to do is prove 
that their intersection 

[72^(^i) U 72^(52) u • • • u 72^(5^-l)] n n^{s,) 

= [TZ^iSi) n TZeiS,)] U [72^(,52) n 72^(5,)] U • • • U [72^(^*_i) n 72^(5,)] 
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is l)-connected. This union suggests another Mayer- Vietoris argument, but 
what do we know about the connectivity of the TZi{Sj) r\TZ({Si)l 

One of the elegant properties of the round operator is that 

TZ,{s,)mz,{s,) = TZ,_,{s,ns.) 

where c is the number of vertices in Si or Sj that do not appear in Sj O Si, 
whichever number is larger. We refer to this number as the codimension of Si 
and Sj, and it is a measure of how much the two states have in common. 
The TZf_^{Sj n Si) are (£ — c)-connected by our assumption, but we need to 
prove that they are — l)-connected for our inductive argument to go through, 
and it is not generally true that the Si and Sj have codimension c = 1. 

One of the insights in this paper — and one of the reasons that the lower 
bound proof for set agreement is now so simple — is that we can organize the 
inductive argument so that we need only consider pairs of simplexes Si and Sj in 
this union that have codimension c = 1. If we order the set 'R-f^{S) = {S*!, . . . , S^} 
of one-round states correctly, then we can prove that every set TZ({Sj) r\TZ^{Si) 
in the union is contained in another set TZ^(Tj)r\TZi(Si) in the union such that Tj 
and Si have codimension c = 1. The larger set “absorbs” the smaller set, and 
while the smaller set may not have the desired {£ — l)-connectivity, the larger 
set does. Now we can write this union as the union of the absorbing sets, which 
is a union of {£ — l)-connected sets, and apply Mayer-Vietoris to prove that the 
union itself is (£ — l)-connected. In this paper, we show how to define a partial 
order on the set TliiS) = {^i, . . . , Sk} of one-round states that guarantees this 
absorption property holds during the Mayer-Vietoris argument. We call this 
partial order an absorbing poset. 

To illustrate the notion of an absorbing poset, consider once again the com- 
plex 7^1 (5) of global states on the right of Figure Q Suppose we order the 
pseudospheres making up Ti-iiS) by ordering the central triangle first and then 
ordering the cycles surrounding this triangle in some order. Within each cycle, 
let us order the edges of the cycle by ordering the edge of the central triangle 
first, then the two edges intersecting this edge in some order, and finally outer- 
most edge that does not intersect the central triangle. To see that this ordering 
has the properties of an absorbing poset, consider the central triangle T and the 
edge E consisting of the vertexes P; PR and R; PQR. The simplexes T and E 
intersect in the single vertex R] PQR and hence have codimension two. On the 
other hand, consider the edge F consisting of the vertexes R-, PQR and P; PQR. 
This edge F appears between T and E in the simplex ordering, the intersection 
of F and E is actually equal to the intersection of T and E, and the codimension 
of F and E is one. This property of an absorbing poset is key to the simplicity 
of the connectivity argument given in this paper. 

2.3 Related Work 

We are aware of three other proofs of the fc-set agreement lower bound. 

Chaudhuri, Herlihy, Lynch, and Tuttle 0 gave the first proof. Their proof 
consisted of taking the standard similarity chain argument used to prove the 
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consensus synchronous lower bound and running that argument in k dimensions 
at once to construct a subset of the reachable complex to which a standard 
topological tool called Sperner’s Lemma can be applied to obtain the desired 
impossibility. While their intuition is geometrically compelling, it required quite 
a bit of technical machinery to nail down the details. 



Herlihy, Rajsbaum, and Tuttle US! gave a proof closer to our “round-by- 
round” approach. In fact, the round operator that we define here is exactly the 
round operator they defined. Their connectivity proof for the reachable complex 
was not easy, however, and the inductive nature of the proof did not reflect the 
iterative nature of how the reachable complex is constructed by repeatedly ap- 
plying the round operator locally to a global state S. The notion of an absorbing 
poset used in this paper dramatically simplifies the connectivity proof. 

Gafni P2] gave another proof in an entirely different style. His proof is based 
on simple reductions between models, showing that the asynchronous model can 
simulate the first few rounds of the synchronous model, and thus showing that 
the synchronous lower bound follows from the known asynchronous impossibility 
result for set agreement |4lltil21 |. While his notion of reduction is elegant, his 
proof depends on the asynchronous impossibility result, and that result is not 
easy to prove. We are interested in a simple, self-contained proof that gives as 
much insight as possible into the topological behavior of the synchronous model 
of computation. 

Round-by-round proofs that show how the 1-dimensional (graph) connec- 
tivity evolves in the synchronous model have been described by Aguilera and 
Toueg P and Moses and Rajsbaum pS] (the latter do it in a more general way 
that applies to various other asynchronous models as well) to prove consensus 
impossibility results. These show how to do an elegant FLP style of argument, 
as opposed to the more involved backward inductive argument of the standard 
proofs . They present a (graph) connectivity proof of the successors of a 

global state. Thus, our proofs are similar to this strategy in the particular case 
of fc = 1, but give additional insights because they show more general ways of 
organizing these connectivity arguments. 



There are also various set agreement impossibility results for asynchronous 
systems that are related to our work. 

Attiya and Rajsbaum |2| and Borowsky and Gafni 2| present two similar 
proofs for the set agreement impossibility. The relation to our work is that they 
are also combinatorial. However, their proofs are for an asynchronous, shared 
memory model. Also, they do not have a round-by-round structure; instead they 
work by proving that the set of global states at the end of the computation 
has some properties (somewhat weaker than connectivity) that are sufficient to 
apply Sperner’s Lemma and obtain the desired impossibility result. 

Borowsky and Gafni |S! defined an asynchronous shared-memory model 
where variables can be used only once, a model they showed to be equivalent 
to general asynchronous shared-memory models. They defined a round operator 
as we do, and they showed that one advantage of their model was that it had 
a very regular iterative structure that greatly simplified computing its connec- 
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tivity. Unfortunately, their elegant techniques for the asynchronous model do 
not extend to the synchronous model. Reasoning about connectivity is harder 
in the synchronous model than the asynchronous model for two reasons. First, 
the connectivity never decreases in the asynchronous model, whereas it does in 
the synchronous model, so their techniques cannot extend to our model. Second, 
processors never actually fail in their construction since dead processors can be 
modeled as slow processors, but this is not an option in our model, and we are 
forced to admit simplexes of many dimensions as models of global states where 
processors have failed. 

3 Topology 

We now give formal definitions of the topological ideas sketched in the introduc- 
tion. 

A simplex is just a set of vertexes. Each vertex v is labeled with a processor 
id id{v) and a value val{v). We assume that the vertexes of a simplex are labeled 
with distinct processor ids, and we assume a total ordering <id on processor ids, 
which induces an ordering on the vertexes of a simplex. A face of a simplex 
is a subset of the simplex’s vertexes, and we write F C S' if F is a face of S. 
A simplex X is between two simplexes Sq and Si if So C AT C Si. A complex 
is a set of simplexes closed under containment (which means that if a simplex 
belongs to a complex, then so do its faces). If ^ is a set of simplexes, denote 
by ll^ll the smallest simplicial complex containing every simplex of A. It is easy 
to show that 



\\AUB\\ = \\A\\U\\B\\ and M n S|| c ||yi|| n ||^|| . 

The codimension of a set S = {Si, . . . , Sm} of simplexes is 

codim{S) = max|dim(Si) — dim(rijSj)} = max{|Sj — rijS^j} , 

I I 

where dim(0) = — 1 is the dimension of the empty simplex. This definition sat- 
isfies several simple properties, such as: 

1. If X is between two simplexes S and T, then 

codim{S,T) = codim{S,X) + codim{X,T). 

2. If codim{So, Si) < I for f = I, . . . , m, then 

codim{So, Si, . . . , Sm) < w- 

3. If Si, . . . , Sra is a set of simplexes with largest dimension N and codimen- 
sion c, then their intersection Si fl • • • fl Sm is a simplex with dimension N — c. 

The connectivity of a complex is a direct generalization of ordinary graph 
connectivity. A complex is 0-connected if it is connected in the graph-theoretic 
sense, and while the definition of fc-connectivity is more involved, the precise 
definition does not matter here since our work depends only on two fundamental 
properties of connectivity: 
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Theorem 1. 

1. If S is a simplex of dimension k, then the indueed eomplex US'!! is {k — 1)- 
connected. 

2. If the complexes C and A4 are k-connected and CC\Ai is {k— l)-connected, 
then CU A4 is k-connected. 

By convention, a nonempty complex is (— l)-connected, and every complex is 
(— fc)-connected for A: > 2. The first property follows from the well-known fact 
that 1 1 S' 1 1 is fc-connected when S is a nonempty simplex of dimension k (k > 0), 
but when S is empty (fc = —1), the best we can say is that ||S|| is (— 2)-connected 
(since every complex is (— 2)-connected). The second property above is a con- 
sequence of the well-known Mayer- Vietoris sequence which relates the topology 
of £U with that of £, M. and £n Af (for example, see Theorem 33.1 in m)- 

4 Computing Connectivity 

Computing the connectivity of TZ^{A) for some complex A depends on properties 
of the round operator TZg and on how the Mayer-Vietoris argument used in the 
proof is organized. In this section, we define the notion of an /-operator and 
the notion of an absorbing poset that structures the Mayer-Vietoris argument 
by imposing a partial order on simplexes in the complex A, and we prove that 
an /-operator applied to this partially-ordered complex A is connected. 

A simplicial operator Q is a function with an associated domain. It maps 
every simplex S in its domain to a set 2(5') of simplexes, and it extends to sets 
of simplexes in its domain in the obvious way with Q{A) = U5g^2(5). Proving 
the connectivity of Q{A) is simplified if Q satisfies the following property: 

Definition 1. Let Q be an operator, and let f be a function that maps each 
set A of simplexes in the domain of Q to an integer f{A). We say that Q is 
an /-operator if for every set A of simplexes in the domain of Q 

II 2(5) II is if (A) — c — l)-connected 

s&A 

where c = codim{A). 

To illustrate this definition, consider a single simplex 5 of dimension k, and 
remember the fundamental fact of topology that ||5|| is (fc — l)-connected. Now 
consider two simplexes 5 and T of dimension k that differ in exactly one vertex 
and hence have codimension one. Their intersection 5 fl T has dimension k — 1, 
so ||5nT|| is (fc— 1 — l)-connected. In fact, we can show that ||5||n||T|| is (fc— 1— 1)- 
connected and, in general, that ||5|| fl ||T|| is {k — c — l)-connected if c is the 
codimension of 5 and T. In other words, the connectivity of their intersection 
is reduced by their codimension. In the definition above, if we interpret f{A) as 
the maximum connectivity of the complexes ||2(5)|| taken over all simplexes 5 
in A, then this definition says that taking the intersection of the ||2(5)|| reduces 
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the connectivity by the codimension of the S. As a simple corollary, if we take 
the identity operator I{S) = {S'} and define f{A) = maxsg^dim(S) to be the 
maximum dimension of any simplex in A, then we can prove that the identity 
operator is an /-operator. 

Proving the connectivity of Q{A) is simplified if there is a partial order on 
the simplexes in A that satisfies the following absorption property: 

Definition 2. Given a simplicial operator Q and a nonempty partially-ordered 
set (S, of simplexes in the domain of Q, we say that (S, is an absorbing 
poset for Q if for every two simplexes S and T in S with T S there is Ts € S 
with Ts such that 

||Q(S)||n||Q(T)|| C ||Q(T 5 )||n||Q(T)|| (1) 

codim{Ts,T) = 1. (2) 

For example, if S is totally ordered, then every intersection ||Q(S)|| fl ||Q(T)|| 
involving a simplex S preceding a simplex T is contained in another intersection 
II Q(T 5 ) II n II Q{T)\\ involving another simplex Ts preceding T with the additional 
property that Ts and T have codimension 1. 

To see why such an ordering is useful, consider the round operator TZ^, and 
remember the problem we faced in the overview of proving that 

ni{Si)uni{S2)u---uTZi{s,) 

is connected for t = 1,... ,k- We were concerned that TZ^{Sj) r\TZ({Si) = 
TZ^_^{SjnSi) was (£— c)-connected and in general might not be (£— l)-connected 
since the codimension c of Sj and Si might be too high. If we can impose an 
ordering on the Si,... ,Sk and prove that the ,Sk form an absorbing 

poset for TZg, then each TZ^fSj) fl TZ^(Si) with / < t is contained in another 
TZf.{Sji) n TZ^(Si) with f < i where Sj> and Si have codimension one. This 
means that when computing the connectivity of the union we can restrict our 
attention to the intersections TZ^^Sj') fl TZ^{Si) with codimension one, and the 
proof goes through. 

In general, we can prove that applying an operator to an absorbing poset 
yields a connected complex: 

Theorem 2. If Q is an f -operator and {A, is an absorbing poset for Q, then 

l|2(-4)|| = [J ||Q(5')|| is {f -1)- connected 
sgA 

where f = minBc.4 f{B). 

In the special case of the identity operator, we say that {A, :<) is an absorbing 
poset if it is an absorbing poset for the identity operator. It is an easy corollary 
to show that if {A, :<) is an absorbing poset, then 

ll“4|| = [J ||<S'|| is {N — l)-connected, 

SgA 

where N is the minimum dimension of the simplexes in A. 
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5 Synchronous Connectivity 

In this section, we show how to use the ideas of the previous section to prove 
that TZl,{S) is (fc — l)-connected, from which we conclude that fc-set agreement 
is impossible to solve in r rounds. 



5.1 Round Operators 

Given a simplex S representing the state at the beginning of a round, and given 
a set X of processors that fail during the round, let F = S/X he the face 
of S obtained from S by deleting the vertexes labeled with processors in X. 
The set of all possible states at the end of a round of computation from S in 
which processors in X fail can be represented by the set of all possible simplexes 
obtained by labeling the vertexes of F with simplexes between F and S. This set 
of simplexes obtained in this way forms a set that we call a pseudosphere. For 
every simplex S, the pseudosphere operator Vs{F) maps a face F of S' to the set 
of all labelings of F with simplexes between F and S. The set Vs{F) is called 
a pseudosphere, and the face F is called the base simplex of the pseudosphere. 
Given a simplex T contained in a pseudosphere Vs{F) we define base{T) to be 
the base simplex F of the pseudosphere. 

If i processors fail during the round, there there are many ways to choose this 
set X of processors that fail, and hence many ways to choose the base simplexes 
F = S/X for the pseudospheres whose simplexes represent the states at the 
end of the round. For every integer ^ > 0, the £-failure operator Fi{S) maps a 
simplex S to the set of all faces F of S' with codim{F, S) < £, which is the set 
of all faces obtained by deleting at most £ vertexes from S. The domain of the 
operator Fi{S) is the set of all simplexes S with dim(S) > £. 

Finally, for every integer £ > 0, the synchronous round operator is 

defined by 

n,{s) = Vs{W)). 

The domain of this operator TZg{S) is the set of all simplexes S with dim(S) > 
£ + k. This round operator satisfies a number of basic properties such as: 

Lemma 1. 

1. F-giS) C F^{S) if £<m and S is in the domain ofF,^. 

2. Ff,[s) C 7^^+^(T) ifSCTandc= codim{S,T). 

3. F(^^Si) n • • • n F^{Sm) = Ff_^{Si n • • • n Sm) if C = eodim{Si, , Sm), 
£ > c, and each Si is in the domain ofF^. 

I ||7^,(Sl)|| n • • • n ||7^,(s^)|| = ||7^,(Sl) n • • • n 7^,(s^)||. 

Proof. We sketch the proof of property 0 

For the 3 containment, suppose A G Fg_^{DjSj). This means that A is a 
labeling of a simplex F with simplexes between F and HjSj for some face F 
of r\jSj satisfying codim{F,r\jSj) < £ — c. Since A is a labeling of F with 
simplexes between F and DjSj, it is obviously a labeling of F with simplexes 
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between F and Si. We have A G TZ^{Si) since F is a face of (IjSj which is in 
turn a face of Si, and hence 

codim{F, Si) = codim{F,r\jSj) + codim{C\jSj, Si) 

< codim{F, HjSj) + codim{Si, . . . , Sm) 

< {£ — c) + c = £. 

For the C containment, suppose A G r\jTZi(Sj). For each i, we know that A 
is a labeling of Fi with simplexes between Fi and Si for some face Fi of Si 
satisfying codim{Fi, Si) < £. Since A is a labeling of Fi for each i, it must be 
that the Fi are all equal, so let F be this common face of the Si and hence 
of r\jSj. Since A is a labeling of F with simplexes between F and Si for each i, 
it must be that A is a labeling of F with simplexes between F and C\jSj. Since F 
is a face of C\jSj which is in turn a face of each Si, including any Sm satisfying 
codim{r\jSj, Sm) = codim{Si, . . . ,Sm) = c, we have A G TZ^_^{T\jSj) since 
codim{F, C\jSj) = codim{F, Sm) — codim{C\jSj, Sm) < £ — c. □ 

The round operator TZ^ models a single round of computation. We model 
multiple rounds of computation with the multi-round operator defined 

inductively by TZ^TZ^(S) = TZ^{S) and W^TZiiS) = TZi^{TZ^j^^TZ^{S)) for r > 0. 
The domain of TZ^^TZ^ is the set of all simplexes S with dim(S') > rL -I- ^ -I- fc. 
The properties of one-round operators given above generalize to multi-round 
operators where TZ^ is replaced by TZ}^TZ^. 

5.2 Absorbing Posets 

We now impose a partial order on TZ^{S) and prove that it is an absorbing poset. 
First we order the pseudospheres Vs{F) making up TZ^{S), and then we order 
the simplexes within each pseudosphere Vs{F). 

Both of these orders depend on ordering the faces F of S, which we do 
lexicographically. First we order the faces F by decreasing dimension, so that 
large faces occur before small faces. Then we order faces of the same dimension 
with a rather arbitrary rule using on our total order on processor ids: we order Fq 
before F\ if the smallest processor id labeling vertexes in Fq and not F\ comes 
before the smallest processor id labeling Fi and not Fq. Formally: 

Definition 3. Define the total order <f on the faces of a simplex S by Fq <f Fi 

^ff 

1. dim(Fo) > dim(Fi) or 

2. dim(Fo) = dim(Fi) and either 

a) Fq = Fi or 

b) Fq ^ Fi and po <id Pi 

where po = min{zds(Fo) — ids(Fi)} and pi = min{zds(Fi) — zds(Fo)} . 

This face ordering induces an ordering on pseudospheres: Vs{Fq) comes be- 
fore Vs{Fi) if Fq comes before Fi in the face ordering. This face ordering also 
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induces an ordering on the simplexes within a single pseudosphere Vs{F)\ Sq 
comes before Si if for each vertex v of the base simplex F the face labeling v 
in Sq comes before the face labeling u in This ordering of Vs{F) is defined 
formally as follows: 

Definition 4. Define the partial order on Vs{F) by So dip Si iff 

So,v df Si^y for each vertex v in F where So^y and Si^y are the simplexes la- 
beling the vertex v in Sq and Si. 

Pulling everything together, the partial order on TZfiS) is defined as follows: 
Definition 5. Define the partial order dr on Ti-iiS) by So dr Si iff 

1. different pseudospheres: base(So) </ base(Si) or 

2. same pseudosphere: base{So) = base{Si) and So dp Si 

Now we can prove that {'R-£{S),dr) is an absorbing poset, and that 

Lemma 2. (TZfiS), is an absorbing poset for TZf for dim(S') > rL -\- i -\- k. 

Proof. We prove only the base case that {'R.({S),dr) is an absorbing poset. Let A 
and B be two simplexes in TZfiS) satisfying B 'fi.y A. 

Case 1: Suppose A and B are in the same pseudosphere Vs{F) for some 
face F of S. The simplexes A and B are labelings of F with simplexes between F 
and S, so let Ay and By denote the label of vertex v in A and B for every vertex 
V G F. There must be some vertex v with Ay <f By since B A. Let Ba be B 
with the label of v changed from By to Ay. We have Ba -<r B since the label of v 
in Ba is ordered before the label of u in S, and the labels of all other vertices 
are equal. We have AC B C Ba since v is not in AC\ B due to the conflicting 
labels for v, while all other vertexes of B and hence of A CB are in Ba. Finally, 
we have codim{B a, B) = 1 since Ba and B differ only in the label of v. 

Case 2: Suppose A and B are in different pseudospheres Vs{Fa) and Vs{Fb) 
for distinct faces Fa and Fb of S. We can assume without loss of generality that 
every vertex of B — A is labeled with S', and we can show that Fa < / Fb . 

Case 2a: Suppose dim(F^) > dim(FB). Since dim(FA) > dim(F's), the set 
Fa—Fb must be nonempty, so choose any vertex v G Fa~Fb. Since A G Vs{Fa), 
the simplex A must be a labeling of Fa with simplexes between Fa and S. 
Since u is a vertex of Fa, this means that v appears in all simplexes labeling A, 
and hence in all simplexes labeling AC B. Since we have assumed that S is 
the label of every vertex in B — A, and since S certainly contains the vertex v, 
the vertex v appears in all labels oi B — A. It follows that v appears in every 
simplex labeling B. Let Ba be the simplex consisting of B together with the 
vertex v labeled with S, and notice that Ba is a simplex in TZfiS). We have 
Ba -<r B since dim{BA) = dim(i?) + 1 > dim(i?). We have A(1 B C Ba since 
Ac B C B C Ba. We have codim{B, Ba) = 1 since B and Ba differ only in v. 

Case 2b: Suppose dim(F^) = dim(FB), in which case we have pa <pPB 
where pA = min {zds(F^) — zds(Fs)} and pB = min {ids (Fb) — ids (Fa)}. 
Let VA and vb be the vertexes for processors pA and pb in the faces Fa and Fb 
of S. Let Fc be the face of S obtained from Fb by replacing vb with va, and 




A New Synchronous Lower Bound for Set Agreement 149 



let C be the labeling of Fq obtained by labeling va with S and every other ver- 
tex with its label in B. Since A is a labeling of Fa with simplexes between Fa 
and S', and since va is a vertex of Fa, the vertex va appears in every simplex 
labeling A and hence AC\ B] and since we are assuming that every vertex of 
B — Ais labeled with S which certainly contains va, it follows that every vertex 
of -B — A is labeled with a simplex containing va', and hence it follows that every 
label in B contains Fc- It follows that C G F^{S) since C is a labeling of a 
face Fc of S with simplexes between Fq and S. We have C B since 

min {ids{Fc) — ids{FB)} = Pa -<p Pb = niin {jds(Bs) — ids{Fc)} ■ 

We have A n B C C and codim{B, C) = 1. Taking Ba = C, we are done. □ 

5.3 Connectivity 

All that remains is to prove that TZ^. is a fc-operator. This follows from the 
following pair of statements proven by mutual induction: 

Theorem 3. For all r > 0, 

1. \\TUiTi-i^{S)\\ is {k — l)-connected for all £ > 0, all L > k, and all S in the 
domain o/ 72.^7?.^. 

2. TFjfR-e « k- operator for all L,£> k. 

Since 72.^ is a fc-operator, and since is an absorbing poset for 72.JJ, 

the connectivity follows by Theorem 

Corollary 1. ||7?.^(S')|1 is (fc — l)-connected i/dim(S') > (r -|- l)fc. 

6 Conclusion 

As we have said, the impossibility of fc-set agreement now follows directly from 
the connectivity of ||72.^(S')|| using standard arguments based on variants of 
Sperner’s Lemma that have appeared in several places now. We hope that the 
notions of a round operator and an absorbing poset will yield simple proofs of 
other results, and will show the way toward simple proofs in other models of 
computation such as the asynchronous message-passing model. 
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Abstract. Do-All is the problem of performing N tasks in a distributed 
system of P failure-prone processors |S]. Many distributed and parallel 
algorithms have been developed for this problem and several algorithm 
simulations have been developed by iterating Do-All algorithms. The 
efficiency of the solutions for Do-All is measured in terms of work com- 
plexity where all processing steps taken by the processors are counted. 
We present the hrst non-trivial lower bounds for Do-All that capture the 
dependence of work on N, P and /, the number of processor crashes. 
For the model of computation where processors are able to make perfect 
load-balancing decisions locally, we also present matching upper bounds. 
We dehne the r-iterative Do-All problem that abstracts the repeated 
use of Do-All such as found in algorithm simulations. Our /-sensitive 
analysis enables us to derive a tight bound for r-iterative Do-All work 
(that is stronger than the r-fold work complexity of a single Do-All). 
Our approach that models perfect load-balancing allows for the analysis 
of specific algorithms to be divided into two parts: (i) the analysis of the 
cost of tolerating failures while performing work, and (ii) the analysis of 
the cost of implementing load-balancing. We demonstrate the utility and 
generality of this approach by improving the analysis of two known effi- 
cient algorithms. Finally we present a new upper bound on simulations 
of synchronous shared-memory algorithms on crash-prone processors. 



1 Introduction 

Performing a set of tasks in a decentralized setting is a fundamental problem in 
distributed computing. This is often challenging because the set of processors 
available to the computation and their ability to communicate may dynamically 
change due to perturbations in the computation medium. An abstract statement 
of this problem, referred to as the Do-All problem — P fault-prone processors 
perform N independent tasks — is one of the standard problems in the re- 
search on the complexity of fault-tolerant distributed computation l^rrzj . This 
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problem has been studied in shared-memory models {Write- All) mm6\ . in 
message-passing models |6I8I10| - and in partitionable networks {Omni-Do) 0 
mm. Solutions for Do-All must perform all tasks efficiently in the presence of 
specific failure patterns. The efficiency is assessed in terms of work, time and 
communication complexity depending on the specific model of computation. 

In the design of practical distributed/parallel programs one needs to ensure 
good performance and dependability under unpredictable load patterns caused 
by deviations from synchrony or by the failures of some processors to complete 
tasks on time. Here again, a common challenge is to perform N independent 
tasks on P processors m- Such tasks could be copying a large array, searching 
a collection of data, or applying a function to all elements of a matrix HH 

In this paper we focus on the work complexity of the Do-All problem in the 
presence of arbitrary failure patterns imposed by an adversary. The processors 
are synchronous and are assumed to be fail-stop m- The work complexity re- 
flects the total amount of processing steps expended by an algorithm m- A 
distinguishing feature of our results is that the complexity is expressed in terms 
of the number of processor crashes / in addition to P and N. 

Our approach is motivated in part by the analyses of the consensus prob- 
lems. The venerable FTP impossibility result |0| and the algorithms that solve 
consensus in the models that allow fault-tolerant solutions teach the following: 
{i) asynchronous models are too weak for fault-tolerance 1221, and {ii) the max- 
imum number of processor failures needs to be included in upper/lower bounds 
and impossibility results, e.g., to tolerate / failures in some models, consensus 
algorithms require P = 3/ -I- 1 total processors |232H1. In this work we consider 
crash failures to ensure that solutions exist for as long as the number of failures 
/ is inferior to the number of processors P, and we aim to express the work of 
the synchronous processors as a function of N, P and /. 

Until very recently, an unsatisfactory landscape existed with respect to the 
understanding of how the bounds on work depend on /, the number of failures. 
That is, work was typically given as a function of N and P, but it was either 
not elucidated how / impacts work, or, when / was a part of the equation, 
it was primarily due to the nature of a specific algorithm, and not due to the 
inherent properties of the Do-All problem. For example, the work of the best 
known synchronous shared-memory algorithm HZ! is given as a function of N 
and P. This is also the case with the best known asynchronous shared-memory 
algorithm |2| . Similarly, the best known shared-memory lower bound on work for 
Do-All is not parameterized in terms of / PJ. Likewise, the best known lower 
bound applicable to message-passing models does not involve / 0 . The work of 
message-passing algorithms, e.g. pIlDj . typically includes /, but this is due to the 
use of single coordinators, which means that for / coordinator failures the work 
necessarily includes a factor f ■ P. A message-passing algorithm using multiple 
coordinators 0 avoids this inefficiency and includes a factor that depends on 
log / (but as we show in this paper, that analysis involves / in a somewhat 
superficial way). Thus prior lower/upper bound results for Do-All do not teach 
adequately how the work complexity depends on the number of failures /. 
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When considering the synchronous shared-memory computing with failure- 
prone processors the impact of imprecise analysis of work complexity is especially 
significant. Approaches such as use iterative Do-All approach to execute 

synchronous parallel (pram) algorithms on failure-prone processors by simulat- 
ing the parallel steps of ideal processors with the help of some chosen Do-All 
algorithm (see also related work below). It was shown that the execution of a 
single 7V-processor step on P failure-prone processors does not exceed the com- 
plexity of solving a N-size instance of Do-All using P failure-prone processors. 
Thus if Wn.p is the complexity of solving a Do-All instance of size N using P 
processors, and the parallel-timexprocessor product of the given 7V-processor 
algorithm is t ■ N, then the algorithm can be deterministically simulated with 
work 0{t ■ Wn,p)- If the analysis does not accurately reflect the impact of the 
number of failures /, then the resulting upper bound is needlessly inflated. 

Contributions. In this work we study the work complexity of deterministic 
Do-All in the presence of arbitrary dynamic patterns of stop- failures. Let N be 
the size of the Do-All problem, P the number of processors, and / the number of 
crashes (0 < / < P < N). We present the first complete analysis of Do-All work 
complexity under the perfect load balancing assumption by proving matching 
upper and lower bounds as functions of N, P and f. This is for the model of 
computation where the computation is fully abstracted away from the low-level 
shared- memory and message-passing issues, and where a worst-case omniscient 
dynamic adversary can cause up to / crashes. This also establishes the first non- 
trivial lower bound for Do-All for moderate number of failures (/ < P/logP). 
An important contribution of this work is the definition and analysis of the r- 
iterative Do-All problem that models the repetitive use of Do-All algorithms 
(such as found in algorithm simulations). 

We demonstrate the utility and generality of our results by showing new 
bounds on work for fault-tolerant simulations of arbitrary pram algorithms on 
crash-prone processors, and by improving the analyses of two known algorithms. 
We derive a new and complete failure sensitivity analysis of the best known 
algorithm for the synchronous shared-memory model (algorithm W ^3)- We 
also give an improved analysis of the work and message complexity for an efficient 
message-passing algorithm (algorithm AN ^). 

We give a detailed summary of the complexity results in Section |3 

Related work — algorithm simulations. Do-All algorithms can be used it- 
eratively to simulate parallel algorithms formulated for synchronous failure-free 
processors in deterministic and probabilistic settings This com- 

monly requires that (i) the individual processor steps are made idempotent (since 
they may have to be performed multiple times due to failures or asynchrony), 
and that (ii) a linear in the number of processors auxiliary memory is made 
available (to be used as a “scratchpad” and to store intermediate results). While 
the former can be solved with the help of an automated tool, e.g., a compiler, 
the latter requires sophisticated solutions because of the difficulty of (re)using 
the auxiliary memory due to “late writers” (i.e., processors that are slow and 
that unknowingly write stale values to memory). Examples of randomized so- 
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lutions addressing these problems include HE3. Another important aspect of 
algorithm simulations is the use of on optimistic approach, where the compu- 
tation may proceed for several steps assuming that all tasks assigned to active 
processors are successfully completed, e.g., m- In some deterministic models 
optimal simulations are possible (cf. |2h|l. however randomized solutions are 
able to achieve optimality (whp) for broader ranges of models and algorithms. 
An example of a practical implementation is discussed in 0. 

The rest of the paper is structured as follows. In Section |21 we summarize the 
results. Section 0 we give models and definitions. In Section 2] we present the 
bounds under the perfect load-balancing assumption. We give new upper bounds 
for the message-passing model in Section 0 and for the shared-memory model 
in Section 13 We conclude in Section 0 

Due to the page limit, most proofs are either omitted or given as sketches. 
A full version of this paper is available m- 



2 Grand Tour of the Results 

We let Do-All{N, P, /) stand for the Do-All problem for N tasks, P processors 
and up to / failures. We let Do-AlP{N, P, f) denote the Do-All{N, P, f) problem 
that is solved with the use of an omniscient oracle that assists the processors 
(but unlike the oracle’s delphian colleague, it cannot predict the future). The 
oracle assumption is used as a tool for studying the work complexity patterns of 
any fault-tolerant algorithm that implements perfect work- load balancing. This 
allows for the complexity analysis of specific algorithms to be divided into two 
parts: (f) the analysis of the cost of tolerating failures while performing work 
assuming perfect load-balancing, and {ii) the analysis of the cost of implementing 
perfect load-balancing. We use exactly this approach to derive new /-sensitive 
upper bounds for message-passing and shared- memory models. 

We have shown [ 1 21 1 ?S) that Do-All^{N, P, f) can be solved with work 0{N -\- 
P loj^i^g p ) where f < P, and gave a matching lower bound in the specific case 
where / = i^g i^g p + 0( (iogi^gP )2 )■ This meant that as long as the adversary 
can cause at least ipg i^g p failures, Do-All^{N, P, /) has matching upper and 
lower bounds of -I- P i^'g ■ We also showed that when / = o( ) then 

Do-AlP{N, P, /) can be solved with work 0{N PPgn,ax{2,^} P)- 

Thus prior to our newest results: (i) no non-trivial lower bounds were known 
for / < {ii) no /-sensitive analysis was available for the upper bounds 

when / is between and ipg i^g p > and therefore, (in) there existed a gap in 
upper/lower bounds analysis for the range 1 < / < ipg ]^g p , where / = w(l). 
Yet practical concerns would be well served by the knowledge of what happens 
in Do-All when the number of failures is moderate. In particular, it is important 
to understand the behavior of the best algorithms for the entire range of /. 

The detailed contributions in this work are as follows. 
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I. We provide upper bounds f Section 14. IB and matching lower bounds (Sec- 
tion ^21) that address all remaining gaps, hence we give a complete analysis 
of Do-AlP{N , P, /) for the entire range of /. The bounds on work W are0 



{a)w={om 

(b) w={0\n} (^N + PlogpP 

The lower bounds of course apply to algorithms in weaker models. 

The quantity Qpj, defined below and extracted from the bounds Cl)(a,b) above, 
plays an important role in the analysis of complexity of several algorithms. 

^k|k|p when / > cjj|p, any c > 0, 

Plogp P when / < any c > 0. 




when /> cj^, any c> 0, 
when / < cj^, any c> 0. 



We use our bounds m to derive new bounds for algorithms where the extant 
analyses do not integrate / adequately. This is done by analyzing how the work- 
load balancing is implemented by the algorithms, e.g., by using coordinators or 
global data-structures. We show the following. 

II. In Section El we provide new analysis of algorithm AN of Chlebus et al. g] 
for Do-All in the message-passing model with crashes. This algorithm has 
best known work for moderate number of failures. We show the complete 
analysis of work W and message complexity M : 

W = O (log / (iV -I- Qpj)) and M = O (N -t- Qpj -t- fP) . 

III. In Section IP. 1 1 we give a complete analysis of the work complexity W of 

the algorithm of Kanellakis and Shvartsman m that solves the Do-All 

(Write-All) problem in synchronous shared-memory systems with processor 

crashes: ^ i 

W = O (N -\- Qpj logiV) . 

Note that the two algorithms PIT7| are designed for different models and use 
dissimilar data and control structures, however both algorithms make their load- 
balancing decisions by gathering global knowledge. By understanding what work 
is expended on load balancing vs. the inherent work overhead due to the lower 
bounds dU, we are able to obtain the new results while demonstrating the utility 
and the generality of our approach. 

Do-All algorithms have been used in developing simulations of failure-free 
algorithms on failure-prone processors, e.g., firm . This is done by iteratively 
using a Do-All algorithm to simulate the steps of the failure- free processors. In 
this paper we abstract this idea as the iterative Do-All problem as follows: 

^ We let “{0|I7}” stand for “O” when describing an upper bound and for “17” when 
describing a lower bound. All logarithms are to the base 2 unless explicitly specified 
otherwise. The expression log A stands for max{l, logj A} for the given A in the 
description of complexity results. 
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The r-iterative Do-All{N , P, f) , or r-Do-All{N, P, f), is the problem of 
using P processors to solve r instances of A^-task Do-All by doing one 
set of tasks at a time. 



The oracle r-Do-All^{N ^ P, /) is defined similarly. An obvious solution for this 
problem is to run a Do-All algorithm r times. If the work complexity of Do-All 
in a given model is kk/v.p,/, then the work of r-Do-All is clearly no more than 
r ■ Wn^pj. We present a substantially better analysis: 



IV. In Section o we show matching upper and lower bounds on work W for 
r-Do-All^{N , P, /), where f < P < N, for specific ranges of failures. 

(a)W={0|f2}(r.(v + Pj^)) when / > 

(5) W={0|I2}^r- + when/<j|^j,. 

We extract the quantity TZr,pj, defined below, from the bounds 0(a,b) above, 
as it plays an important role in the analysis of iterative Do-All algorithms. 



T^r,PJ 



^k|k|p when / > (a) 

P^when/<j|^. (6) 



( 4 ) 



Note that for any r we have Qpj > TZr^pj, and for the specific range of / 
in 0(b) we have that Qpj = uj{TZr,pj) with respect to r (fixed P and /). Thus 
our bounds 0 ) are asymptotically better than those obtained by computing the 
product of r and the (non-iterated) Do-All bounds 0. 

V. In Section o we show how to solve r-Do-All{N,P,f) on synchronous 

message-passing processors with the following work {W) and message (M) 
complexity. , \ 

W = O • log y (V -I- TZr,p,f)j and M = O {r ■ {N -\- TZr,pj) + fP) ■ 

VI. In Section [Q we use r-Do-All{N, P, f) to show that P processors with 
crashes can simulate any synchronous iV-processor, r-time shared-memory 
algorithm (pram) with work: 

W = 0{r-{N + TZr^pj log N)) . 



This last result is strictly better than the previous deterministic bounds for 
parallel algorithm simulations using the Do-All algorithm (the best known 
to date) and simulation techniques such as |21l29j (due of the the relationship 
between Qpj and TZr.pj as pointed out above). 



3 Models and Definitions 

We define the models, the abstract problem of performing N tasks in a dis- 
tributed environment consisting of P processors that are subject to stop-failures, 
and the work complexity measure. 
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Distributed setting. We consider a distributed system consisting of P syn- 
chronous processors. We assume that P is fixed and is known. Each processor 
has a unique identifier (pid) and the set of pids is totally ordered. 

Tasks. We define a task to be a computation that can be performed by any 
processor in at most one time step; its execution does not dependent on any 
other task. The tasks are also idempotent, i.e., executing a task many times 
and/or concurrently has the same effect as executing the task once. Tasks are 
uniquely identified by their task identifiers (tids) and the set of tids is totally 
ordered. We denote by T the set of N tasks and we assume that T is known to 
all the processors. 

Model of failures. We assume the fail-stop processor model A processor 
may crash at any moment during the computation and once crashed it does not 
restart. We let an omniscient adversary impose failures on the system, and we 
use the term failure pattern to denote the set of the events, i.e., crashes, caused 
by the adversary. A failure model is then the set of all failure patterns for a given 
adversary. For a failure pattern F, we define the size f of the failure pattern as 
/ = |F| (the number of failures). 

The Oracle model. In Section 0 we consider computation where processors 
are assisted by a deterministic omniscient oracle. Any processor may contact 
the oracle once per step. The introduction of the oracle serves two purposes. 

(1) The oracle strengthens the model by providing the processors with any in- 
formation about the progress of the computation (the oracle cannot predict the 
future). Thus the lower bounds established for the oracle model also apply to 
any weaker model, e.g., without an oracle. 

(2) The oracle abstracts away any concerns about communication that normally 
dominate specific message-passing and shared- memory models. This allows for 
the most general results to be established and it enables us to use these results 
in the context of specific models by understanding how the information provided 
by an oracle is simulated in specific algorithms. 

Communication. In SectionsElandElwe deal with message-passing and shared- 
memory models. For computation in the message-passing model, we assume that 
there is a known upper bound on message delays. (Communication complexity 
is defined in Section 0) When considering computation in the shared-memory 
model, we assume that reading or writing to a memory cell takes one time unit, 
and that reads and writes can be concurrent. 

Do-All problems. We define the Do-All problem as follows: 

Do-All: Given a set P of N tasks and P processors, perform all tasks 
for any failure pattern in the failure model T . 

We let Do-All{N, P, f) stand for the Do-All problem for N tasks, P processors 
{P < N), and any pattern of crashes F such that |F| < / < P. We let Do- 
All^{N,P,f) stand for the Do-All{N, P, f) problem with the oracle. We define 
the iterative Do-All problem as follows: 
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Iterative Do-All: Given any r sets 71,... ,77- of N tasks each and P 
processors, perform all r ■ N tasks, doing one set at a time, for any 
failure pattern in the failure model T . 

We denote such r -iterative Do-All by r-Do-All{N, P, f). The oracle version r- 
Do-AlP{N, P, f) is defined similarly. 

Measuring efficiency. We are interested in studying the complexity of Do-All 
measured as work (cf. [I TpSBij b We assume that it takes a unit of time for a 
processor to perform a unit of work, and that a single task corresponds to a unit 
of work. Our definition of work complexity is based on the available processor 
steps measure PS!- Let T be the adversary model. For a computation subject to 
a failure pattern F,F ^ T , denote by PilF) the number of processors completing 
a unit of work in step i of the computation. 

Definition 1. Given a problem of size N and a P -processor algorithm that 
solves the problem in the failure model T , if the algorithm solves the problem 
for a pattern F in T , with |F"| < f, by time step t, then the work complexity W 

of the algorithm is: Wn,pj = maxi^g^r, |j’|</ {S i<T (F)}. 

Note that the idling processors consume a unit of work per step even though 
they do not contribute to the computation. Definition ^ does not depend on the 
specifics of the target model of computation, e.g., whether it is message-passing 
or shared-memory. (Communication complexity is defined similarly in Sectional) 

4 The Bounds with Perfect Load Balancing 

In this section we give the complete analysis of the upper and lower bounds for 
the Do-AlP{N, P, f) and r-Do-AlP{N , P, f) problems for the entire range of / 
crashes {f < P < N). (Note: we use the quantities Qpj and TZr,pj that are 
defined in Section 0 in equations 0 and (0) respectively.) 

4.1 Do-All Upper Bounds 

To study the upper bounds for Do-All we give an oracle-based algorithm in 
Figure 0 The oracle tells each processor whether or not all tasks have been 
performed Oracle-say sif), and what task to perform next Oracle-taskQ (the cor- 
rectness of the algorithm is trivial). Thus the oracle performs the termination 
and load-balancing computation on behalf of the processors. 

for each processor pid = 1..P begin 
global T[l..iV]; 

while Oracle-says{piD) = “not done” 

do perform task T[Oracle-task{piD)] od 

end. 

Fig. 1. Oracle-based algorithm. 
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Lemma 1. jl8ll2| The Do-AlP{N, P, f) problem with f < P < N ean be solved 
using work W = O (^N + P i^giogp ) ■ 

Note that Lemma[Ddoes not teach how work depends on the number of crashes /. 

Lemma 2. For any c > 0, Do-AlP{N,P, f) can be solved for any stop-failure 
pattern with f < C y^p using work W = 0{N P Plogp/j N). 

Proof. The proof is based on the proof of Theorem 3.6 of H2|. 

We now give our main upper-bound result. 

Theorem 1. Do-All^{N, P, f) can be solved for any failure pattern using work 
W=0{N+Qpj). 

Proof. This follows directly from Lemmas E and El 

4.2 Do-All Lower Bounds 

We now show matching lower bounds for Do-Ad®(7V, P, f) . Note that the results 
in this section hold also for the Do-All{N, P, f) problem (without the oracle). 

Lemma 3. nan) For any algorithm that solves Do-All‘^{N, P, f) there exists a 
pattern of f stop-failures (f < P) that results in work W = f2 (jSf -\- P p ^ . 

We now define a specific adversarial strategy used to derive our lower 
bounds. Let Alg be any algorithm that solves the Do-All problem. Let Pi be 
the number of processors remaining at the end of the step of Alg and let 
Ui denote the number of tasks that remain to be done at the end of step i. 

Initially, Pq = P and N = Uq. Define k = '^p°og , 0 < k < 1. 

Adversary Adv: At step i {i > 1) of Alg, the adversary stops processors as 
follows: Among Ui-i tasks remaining after the step f — 1, the adversary chooses 
Ui = [nUi-i\ tasks with the least number of processors assigned to them and 
crashes these processors. The adversary continues for as long as Ui > 1. As soon 
as Ui = 1 the adversary allows all remaining processors to perform the single 
remaining task, and Alg terminates. 

Lemma 4. Given any c > 0 and any algorithm Alg that solves Do-All^{N, P, f) 
for P > N, the adversary Adv causes f stop- failures, f < C y^^p , and W = 
I2(P-kPlogp N). 

Proof. (Sketch.) Let t be the number of iterations caused by adversary Adv and 
Pr be the number of processors at the last iteration of Alg. Given the definition 
of Adv (given above), we find that r > ^ P~ f- The result 

follows by the observation that the work caused by Adv is at least t ■ Pr. 
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Lemma 5. Given any c > 0 and any algorithm Alg that solves Do-All'^{N, P, /) 
for P < N, there exists an adversary that eauses f stop- failures, f < c p , and 
W= l7(iV + Plogp P). 

We now give our main lower-bound result. 

Theorem 2. Given any algorithm Alg that solves Do-All^{N, P, f) for P < N, 
there exists an adversary that eauses W = fl{N Qpj). 

Proof. For the range of failures 0 < / < Lemma 0 establishes the bound. 

When / = c p , the work (per Lemma EJ is 17 (^N ■ For larger / 

the adversary establishes this worst case work using the initial failures. 

4.3 Iterative Do-All 

Do-All algorithms have been used in developing simulations of failure-free al- 
gorithms on failure-prone processors. This is done by iteratively using a Do-All 
algorithm to simulate the steps of the failure-free processors. We study the it- 
erative Do-All problems to understand the complexity implications of iterative 
use of Do-All algorithms. 

In principle r-Do-All{N, P, /) can be solved by running an algorithm for Do- 
A11{N, P, f) r times. If the work of a Do-All solution is W, then the work of the 
r -iterative Do-All is at most r-W. However we show that it is possible to obtain a 
finer result. We refer to each Do-All iteration as the round of r-Do-All^(N, P, /). 

Theorem 3. r-Do-All^(N, P, f) ean be solved with W = 0{r ■ {N -\- TZr,pj))- 

Proof. (Sketch) In the case where / > we can have 0(p;|^) processor fail- 
ures in all r rounds. From this and Theorem^ the work is 0{r -{N + P p ) ) ■ 

Consider the case where / < Let fi be the number of failures in round 

ri. It is easily shown that the work in every round is 0{N -|- P ^°^^ )■ We treat 

fi as a continuous parameter and we compute the first derivative and second 
derivative w.r.t. ft. The first derivative is positive, the second derivative is neg- 
ative. Hence the first derivative is decreasing (with fi). In this case, given any 
two fi, fj where fi > fj, the failure pattern obtained by replacing fi with fi — e 
and fj by fj -\- e (where e < {fi — fj)/“2) results in increased work. This implies 
that the work is maximized when all fiS are equal, specifically when fi = //r. 
This results to total work of 0{r ■ {N -\- P ) ) ■ 

Theorem 4. Given any algorithm that solves r-Do-All^(N, P, f) there exists a 
stop-failure adversary that eauses W = f2{r ■ {N -\- TZr.pj))- 

Proof. In the case where / > the adversary may fail-stop processors in 
every round of r-Do-All^(N, P, /). Note that for this adversary I7(P) processors 
remain alive during the first |"r/2] rounds. Per TheoremElthis results in |"r/2] • 

+ ^^ logTo^p ) f - 
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the adversary ideally would kill ^ processors in every round. It can do that in 
the case where / divides r. If this is not the case, then the adversary kills 
processors in ta rounds and in rounds in such a way that r = ta + tb- 
Again considering the first half of the rounds and appealing to Theorem Qresults 
in a J7(r(Af -h Plogrp P)) lower bound for work. Note that we consider only the 
case where r < f; otherwise the work is trivially Q{rN). 

5 New Bounds for the Message-Passing Model 

In this section we demonstrate the utility of the complexity results under the 
perfect load-balancing assumption by giving a tight and complete analysis of the 
algorithm AN ^ and establish new complexity results for the iterative Do-All 
in the message-passing model. 

The efficiency of message-passing algorithms is characterized in terms of their 
work and message complexity. We define message complexity similarly to Def- 
inition Q of work: For a computation subject to a failure pattern F, F G P, 
denote by Mi{F) the number of point-to-point messages sent during step i of 
the computation. For a given problem of size N , if the computation solves the 
problem by step t in the presence of the failure pattern P, where \F\ < F, then 

the message complexity M is M^^pj = maxp’gjr, |f|</ |X)i<T • 

5.1 Analysis of Algorithm AN 

Algorithm AN presented by Chlebus et al. ^ uses a multiple-coordinator ap- 
proach to solve Do-All{N, P, /) on crash-prone synchronous message-passing pro- 
cessors (P < N). The model assumes that messages incur a known bounded de- 
lay and that reliable multicast PS] is available, however messages to/from faulty 
processors may be lost. 

Description of algorithm AN. Due to the space limitation, we give a very 
brief description of the algorithm; for additional details we refer the reader to Pj. 
Algorithm AN proceeds in a loop which is iterated until all the tasks are exe- 
cuted. A single iteration of the loop is called a phase. A phase consists of three 
consecutive stages. Each stage consists of three steps. In each stage processors 
use the first step to receive messages sent in the previous stage, the second step 
to perform local computation, and the third step to send messages. A proces- 
sor can be a coordinator or a worker. A phase may have multiple coordinators. 
The number of processors that assume the coordinator role is determined by the 
martingale principle: if none of the expected coordinators survive through the 
entire phase, then the number of coordinators for the next phase is doubled. If 
at least one coordinator survives in a given phase, then in the next phase there 
is only one coordinator. A phase that is completed with at least one coordinator 
alive is called attended, otherwise it is called unattended. 

Processors become coordinators and balance their loads according to each 
processor’s local view. A local view contains the set of ids of the processors 
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assumed to be alive. The local view is partitioned into layers. The first layer 
contains one processor id, the second two ids, the contains 2®“^ ids. 

Given a phase, in the first stage, the processors perform a task according to 
the load balancing rule derived from their local views and report the completion 
of the task to the coordinators of that phase (determined by their local views). In 
the second stage, the coordinators gather the reports, they update the knowledge 
of the done tasks and they multicast this information to the processors that are 
assumed to be alive. In the last stage, the processors receive the information sent 
by the coordinators and update their knowledge of done tasks and their local 
views. Given the full details of the algorithm, it is not difficult to see that the 
combination of coordinators and local views allows the processors to obtain the 
information that would be available from the oracle in the algorithm in Figure E 
It is shown in P] that the work of algorithm AN is IT = 0{{N + NlogN/ 
log log N) log /) and its message complexity is M = 0{N+P log P/ log log P+fP) . 

New analysis of work complexity. To assess the work W, we consider sep- 
arately all the attended phases and all the unattended phases of the execution. 
Let Wa be the part of W spent during all the attended phases and Wu be the 
part of W spent during all the unattended phases. Hence we have W = Wa + Wu- 

Lemma 6. 0 In any execution of algorithm AN with f < P we have Wa = 
0(^ + ^Ii|Sp) andWa = 0(Walogf). 

We now give the new analysis of algorithm AN. 

Lemma 7. In any execution of algorithm AN we have Wa = 0{N + Plogp P), 
when f < c p , for any c > 0. 

Theorem 5. In any execution of algorithm AN we have W = 0(log f{N+Qpj)). 

Proof. This follows from Lemmas 0 and 0 and the fact that W = Wa + IT„ . 

Analysis of message complexity. To assess the message complexity M we 
consider separately all the attended phases and all the unattended phases of 
the execution. Let Ma be the number of messages sent during all the attended 
phases and the number of messages sent during all the unattended phases. 
Hence we have M = Ma + Mu . 

Lemma 8. 0 In any execution of algorithm AN we have Ma = 0(Wa) and 
Mu = 0{fP). 

Theorem 6. In any execution of algorithm AN we have M = 0{N+Qpj+Pf). 
Proof. It follows from Lemmas 0 0 and 0 and the fact that M = Ma + Mu . 

5.2 Analysis of Message-Passing Iterative Do-All 

We now consider the message-passing r-Do-All{N, P, f) problem (P < N). 

Theorem 7. The r-Do-All{N, P, f) problem can be solved on synchronous 
crash-prone message-passing processors with work W = 0(r -log ^ ■ (N -ITZ^.P./)) 
and with message complexity M = 0(r ■ (N -\- IZr,pj) + fP)- 
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6 New Bounds for the Shared-Memory Model 

Here we give a new refined analysis of the most work-efficient known Do-All 
algorithm for the shared-memory model, algorithm W HZI We also establish the 
complexity results for the iterative Do-All and for simulations of synchronous 
parallel algorithms on crash-prone processors. 

6.1 Analysis of Algorithm W 

Algorithm W solves Do-All{N, P, /) in the shared-memory model (where Do-All 
is better known as Write- All). Its work for any pattern of crashes is 0{N -\- 
PlogNlogP/loglogP) for P < N p7|. Note that this bound is conservative, 
since it does not include /, the number of crashes. 

Description of the algorithm. We now give a brief description of the algo- 
rithm; for additional details we refer the reader to m- Algorithm W is struc- 
tured as a parallel loop through four phases: (Wl) a failure detecting phase, 
(W2) a load rescheduling phase, (W3) a work phase, and (W4) a phase that es- 
timates the progress of the computation, the remaining work and that controls 
the parallel loop. These phases use full binary trees with 0{N) leaves. The pro- 
cessors traverse the binary trees top-down or bottom-up according to the phase. 
Each such traversal takes 0(log A) time (the height of a tree). For a single pro- 
cessor, each iteration of the loop is called a block-step; since there are four phases 
with at most one tree traversal per phase, each block step takes 0(log A) time. 

In algorithm W the trees stored in shared memory serve as the gathering 
places for global information about the number of active processors, remaining 
tasks and load balancing. It is not difficult to see that these binary trees indeed 
provide the information to the processors that would be available from an ora- 
cle in the oracle model. The binary tree used in phase W2 to implement load 
balancing and phase W3 to assess the remaining work is called the progress tree. 

Here we use the parameterized version of the algorithm with P < N and 
where the progress tree has U = max{P, A/ log A} leaves. The tasks are as- 
sociated with the leaves of this tree, with N/U tasks per leaf. Note that each 
block-step still takes time 0(log A). 

New complexity analysis. We now give the work analysis. We charge each 
processor for each block step it starts, regardless of whether or not the processor 
completes it or crashes. 

Lemma 9. CHI For any failure pattern with f < P, the number of block-steps 
required by the P-proeessor algorithm W with U leaves in the progress tree is 

B = 0(U + P^). 

Lemma 10. For any failure pattern with f < (for any c > 0), the num- 

ber of block-steps required by the P -processor algorithm W with U leaves in the 
progress tree is B — 0{U Plogp P). 

Theorem 8. Algorithm W solves Do-All{N, P, f) with work 0{N-\-Qpj log N). 
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6.2 Iterative Do-All and Parallel Algorithm Simulations 

We now consider the complexity of shared-memory r-Do-All{N, P, /) and of 
PRAM simulations. 

Theorem 9. The r-Do-All{N , P, /) problem can be solved on P crash-prone pro- 
cessors (P < N), using shared memory, with work W = 0{r-{N -\-TZr, pj log N)). 

Theorem 10. Any synchronous N -processor, r-time shared-memory parallel al- 
gorithm (PRAM) can be simulated on P crash-prone synchronous processors with 
work 0{r ■ (N -\- TZr.pj log N)). 



7 Conclusions 

In this paper we give the first complete analysis of the Do-All problem under 
the perfect load-balancing assumption. We introduce and analyze the iterative 
Do-All problem that models repeated use of Do-All algorithms, such as found in 
algorithm simulations and transformations. A unique contribution of our analy- 
ses is that they precisely describe the effect of crash failures on the work of the 
computation. We demonstrate the utility of the analyses obtained with the per- 
fect load-balancing assumption by using them to analyze message-passing and 
shared-memory algorithms and simulations that attempt to balance the loads 
among the processors. 
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Abstract. We address the problem of mobile agents searching a ring 
network for a highly harmful item, a black hole, a stationary process 
destroying visiting agents upon their arrival. No observable trace of 
such a destruction will be evident. The location of the black hole is not 
known; the task is to unambiguously determine and report the location 
of the black hole. We answer some natural computational questions: 
How many agents are needed to locate the black hole in the ring ? How 
many suffice? What a-priori knowledge is reguired? as well as complexity 
questions, such as: With how many moves can the agents do it ? How 
long does it take ? 

Keywords: Mobile Agents, Distributed Computing, Ring Network, Haz- 
ardous Search. 



1 Introduction 

The most widespread use of autonomous mobile agents in network environments, 
from the World- Wide- Web to the Data Grid, is clearly to search, i.e., to locate 
some required “item” (e.g., information, resource, . . . ) in the environment. This 
process is started with the specification of what must be found and ends with 
the reporting of where it is located. 

The proposed solutions integrate their algorithmic strategies with an ex- 
ploitation of the capabilities of the network environment; so, not surprising, 
they are varied in nature, style, applicability and performance (e.g., see IMH 
iTwnj i. They do however share the same assumption about the “item” to be 
located by the agents: it poses no danger, it is harmless. 

This assumption unfortunately does not always hold: the item could be a 
local program which severely damages visiting agents. In fact, protecting an 
agent from “host attacks” (i.e., harmful items stored at the visited site) has 
become a problem almost as pressing as protecting a host (i.e., a site) from 
an agent attack (e.g., see [lYll?Sj b Still, this problem has not been taken into 
account so far by any of the existing solutions. 

In this paper we address the problem of searching for a highly harmful item 
whose existence we are aware of, but whose whereabouts are unknown. The item 
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is a stationary process which disposes of visiting agents upon their arrival; no 
observable trace of such a destruction will be evident. Because of its nature, we 
shall call such an item a black hole. 

The task is to unambiguously determine and report the location of the black 
hole (following this phase, a “rescue” activity would conceivably be initiated 
to deal with such a destructive process). In the distributed computing litera- 
ture, there have been many studies on computing in presence of undetectable 
faulty components (e.g., pmTj which can be rephrased in terms of computa- 
tions in presence of black holes). However, a mentioned, this problem has never 
been investigated before. We are interested in understanding the basic algorith- 
mic limitations and factors. The setting we consider is the simplest symmetric 
topology: the anonymous ring (i.e., a loop network of identical nodes). In this 
setting operate mobile agents: the agents have limited computing capabilities 
and bounded storag^, obey the same set of behavioral rules (the “protocol”), 
and can move from node to neighboring node. We make no assumptions on the 
amount of time required by an agent’s actions (e.g., computation, movement, 
etc) except that it is finite; thus, the agents are asynchronous. Each node has a 
bounded amount of storage, called whiteboard] 0(log n) bits suffice for all our al- 
gorithms. Agents communicate by reading from and writing on the whiteboards; 
access to a whiteboard is done in mutual exclusion. 

Some basic computational questions naturally and immediately arise, such 
as: How many agents are needed to locate the black hole ? How many suffice? 
What a-priori knowledge is required? as well as complexity questions, such as: 
With how many moves can the agents do it ? How long does it take ? 

In this paper, we provide some definite answers to each of these questions. 
Some answers follow from simple facts. For example, if the existence of the black 
hole is a possibility but not a certainty, it is impossibl^l to resolve this ambiguity. 
Similarly, if the ring size n is not known, then the black-hole search problem can 
not be solved. Hence, n must be known. Another fact is that at least two agents 
are needed to solve the problem. 

A more interesting fact is that if the agents are co-located (i.e., start from the 
same node) and anonymous (i.e., do not have distinct labels), then the problem 
is unsolvable. Therefore, to find the black hole, co-located agents must be distinct 
(i.e., have different labels); conversely, anonymous agents must be dispersed (i.e., 
start from different nodes). In this paper we consider both settings. 

We first consider distinct co-located agents. We prove that two such agents 
are both necessary and sufficient to locate the black hole. Sufficiency is proved 
constructively: we present a distributed algorithm which allows locating the black 
hole using only two agents. This algorithm is optimal, within a factor of two, also 
in terms of the amount of moves performed by the two agents. In fact, we show 
the stronger result that {n — 1) log(n — 1) -I- 0{n) moves are needed regardless 
of the number of co-located agents, and that with our algorithm two agents can 
solve the problem with no more than 2nlogn -|- 0{n) moves. 



^ O(logn) bits suffice for all our algorithms. 

^ i.e., no deterministic protocol exists which always correctly terminates. 
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We also focus on the minimal amount of time spent by co-located agents to 
locate the black hole. We easily show that 2n — 4 (ideal) time units are needed, 
regardless of the number of agents; we then describe how to achieve such a time 
bound using only n — 1 agents. We generalize this technique and establish a 
general trade-off between time and number of agents. 

We then consider anonymous dispersed agents. We prove that, two anony- 
mous dispersed agents are both necessary and sufficient to locate the black hole 
if the ring is oriented. Also in this case the proof of sufficiency is constructive: we 
present an algorithm which, when executed by two or more anonymous agents 
dispersed in an unoriented ring, allows finding the black hole with O(nlogn) 
moves. This algorithm is optimal in terms of number of moves; in fact, we prove 
that any solution with k anonymous dispersed agents requires J7(nlog(n — k)) 
moves, provided k is known; if k is unknown, 42(nlogn) moves are always re- 
quired. 

We also show that three anonymous dispersed agents are necessary and suffice 
if the ring is unoriented. Sufficiency follows constructively from the result for 
oriented rings. Due to space restrictions, some of the proofs will be omitted. 

2 Basic Results and Lower Bound 

2.1 Notation and Assumptions 

The network environment is a set A of asynchronous mobile agents in a ring TZ 
of n anonymous (i.e., unlabelecj^) nodes. The size n of 72. is known to the agents; 
the number of agents |A| = fc > 2 might not be a priori known. The agents 
can move from node to neighboring node in 72, have computing capabilities and 
bounded storage, obey the same set of behavioral rules (the “protocol”), and 
all their actions (e.g., computation, movement, etc) take a finite but otherwise 
unpredictable amount of time. Each node has two ports, labelled left and right; 
if this labelling is globally consistent, the ring will be said to be oriented, unori- 
ented otherwise. Each node has a bounded amount of storage, called whiteboard; 
O(logn) bits suffice for all our algorithms. Agents communicate by reading from 
and writing on the whiteboards; access to a whiteboard is done in mutual ex- 
clusion. A black hole is a stationary process located at a node, which destroys 
any agent arriving at that node; no observable trace of such a destruction will 
be evident to the other agents. 

The location of the black hole is unknown to the agents. The Black-Hole 
Search (BHS) problem is to find the location of the black hole. More precisely, 
BHS is solved if at least one agent survives, and all surviving agents know the 
location of the black hole (explicit termination) . Notice that our lower bounds are 
established requiring only that at least one surviving agent knows the location 
of the black hole (the difference is only 0(N) moves/time). 

First of all notice that, because of the asynchrony of the agents, we have 
that: 

Alternatively, they all have the same label. 



3 
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Fact 1. It is impossible to distinguish between slow links and a black hole. 

This simple fact has several important consequences; in particular: 

Corollary 1. 

1. It is impossible to determine (using explicit termination) whether or not 
there is a black hole in the ring. 

2. Let the existence of the black hole in TZ be common knowledge. It is impossible 
to find the black hole if the size of the ring is not known. 

Thus, we assume that both the existence of the black hole and the size n 
of the ring are common knowledge to the agents. The agents are said to be 
co-located if they all start from the same node; if they initially are in different 
nodes, they are said to be dispersed. 

Fact 2. Anonymous agents starting at the same node collectively behave as one 
agent. 

Corollary 2. It is impossible to find the black hole if the agents are both co- 
located and anonymous. 

Thus, we assume that if the agents are initially placed in the same node, they 
have distinct identities; on the other hand, if they start from different locations 
there is at most one agent starting at any given node. Finally, observe the obvious 
fact that if there is only one agent the BHS problem is unsolvable; that is 
Fact 3. At least two agents are needed to locate the black hole. 

Thus, we assume that there are at least two agents. Let us now introduce 
the complexity measure used in the paper. Our main measures of complexity are 
the number of agents, called size, and the total number of moves performed by 
the agents, which we shall call cost. We will also consider the amount of time 
elapsed until termination. Since the agents are asynchronous, “real” time cannot 
be measured. We will use the traditional measure of ideal time (i.e., assuming 
synchronous execution where a move can be made in one time unit); sometimes 
we will also consider bounded delay (i.e., assuming an execution where a move 
requires at most one time unit), and causal time (i.e., the length of the longest, 
over all possible executions, chain of causally related moves). In the following, 
unless otherwise specified, “time” complexity is “ideal time” complexity. 

2.2 Cautious Walk 

At any time during the search for the black hole, the ports (corresponding to 
the incident links) of a node can be classified as (a) unexplored - if no agent has 
moved across this port, (b) safe - if an agent arrived via this port or (c) active 
- if an agent departed via this port, but no agent has arrived via it. 

It is always possible to avoid sending agents over active links using a tech- 
nique we shall call cautious walk: when an agent moves from node u to v via an 
unexplored port (turning it into active), it must immediately return to u (mak- 
ing the port safe), and only then go back to v to resume its execution; an agent 
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needing to move from u to f via an active port must wait until the port becomes 
safe. In the following, by the expression moving eautiously we will mean moving 
using cautious walk. Cautious walk reduces the number of agents that may enter 
the black hole in a ring to 2 (i.e., the degree of the node containing the black 
hole). Note that this technique can be used in any asynchronous algorithm A, 
at a cost of 0{n) additional moves, with minimal consequences: 

Lemma 1 ([8j). Let A! be the algorithm obtained from A by enforcing cautious 
walk. For every execution £' of A' there exists a corresponding execution £ of 
A such that £ is obtained from £' by deleting only the additional moves due to 
cautious walk. 

Let us remark that cautious walk is a general technique that can be used 
in any topology; furthermore, it has been shown that every black-hole location 
algorithm for two agents must use cautious walk Pj. 

2.3 Lower Bound on Moves 

In this section we consider the minimum number of moves required to solve 
the problem. The existence of an asymptotic l7(nlogn) bound can be proven 
by carefully adapting and modifying the (rather complex) proof of the result of 
|IlH on rings with a faulty link. In the following, using a substantially different 
argument, we are able to to obtain directly a more precise (not only asymptotic) 
bound, with a simpler proof. In fact we show that, regardless of the setting (i.e., 
collocation or dispersal) and of the number of agents employed, (n — 1) log(n — 
1) -b 0{n) moves are required. 

In the following, we will denote by £* and IL* the explored and unexplored 
area at time t, respectively. Moreover, z* denotes the central node of f*; that is, 
given X* and y*, the two border nodes in £* that connect to 14*, z* is the node 
in £* at distance |"|f*|/2] — 1 from x* . 

Definition 1. A causal chain from a node Vp to a node Vq has been executed at 
time t, if 3d € N, 3ui,U2, . . . ,Ud G V and times ti,t[,t 2 , t'2, . . . ,td,t'j^ sueh that 

— t < ti < t[ < t2 < t '2 < ■ . . < td < t'd, 

— Vp = Ml, Vq = Ud and \/i G {1,2, d — 1} : Ui is a neighbor of Ui+i, and 

— yi G {1, 2, . . . ,d — 1} at time U an agent moves from node Ui to node 
and reaehes iti+i at time t'. 

Lemma 2. Let \L4*\ > 2 at a given time t>0, and k >2. 

1. Within finite time, at least two agents will leave the explored area £* in 
different directions. 

2. A finite time after they have left £*, say at t' > t, a causal chain is executed 
from one of the two border nodes of £t> to Zt . 

Theorem 1. At least (n — 1) log(n — 1) -b 0{n) moves are needed to find a blaek 
hole in a ring, regardless of the number of agents. 
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3 Co-located Distinct Agents 

In this section we consider the case when all agents are co-located but distinct; 

i.e., they start at the same node, called the home base, and have distinct iden- 
tities. The distinct labels of the agents allows any tie to be deterministically 
broken. As a consequence, if the ring is unoriented, the agents can initially agree 
on the directions of the ring. Thus, in the rest of this section, we assume w.l.g. 
that the ring is oriented. 

Let 0, 1, ... n — 1 be the nodes of the ring in clockwise direction (0, —1, ... — 
(n — 1) in counter-clockwise direction) and, without loss of generality, let us 
assume that node 0 is the home base. 

3.1 Agent-Optimal Solution 

At least two agents are needed to locate the black hole (Fact 3). We now consider 
the situation when there are exactly two agents, I and r, in the system, and they 
are co-located. 

The algorithm proceeds in phases. Let Ei and Ui denote the explored and 
unexplored nodes in phase i, respectively. Clearly, Ei and Ui partition the ring 
into two connected subgraphs, with the black hole located somewhere in Ui. 

Algorithm 1 (Two Agents) 

Start with round number i = 1, Ei = {0}, and C/i = {1, 2, . . . , n — 1}. 

1. Divide Ui into two continuous disjoint parts Ul and C/'’ of almost equal sizes. 
Since Ui is a path, this is always possible. (We may assume U- is to the left 
of 0 while Ul is to the right.) 

2. Let agents I and r explore (using Cautious Walk) U- and Ul, respectively. 
Note that, since both of them are within Ei and since Ui is divided into two 
continuous parts, the agents can safely reach the parts they have to explore. 

3. Since U\ and Ul are disjoint, at most one of them contains the black hole; 
hence, one of the agents (w.l.g. assume r) successfully completes step 2. 
Agent r then moves across Ei and follows the safe ports of C// until it comes 
to the node w from which there is no safe port leading to the left. 

4. Denote by Ui+i the remaining unexplored area. (All nodes to the right of w, 
up to the last node of Uf explored by r, are now explored - they form Aj+i.) 
m^+l\ = 1, agent r knows that the black hole is in the single unexplored 
node and terminates. Otherwise Ui+i is divided into U^j^i and Uf_^_i as in 
step 1. Agent r leaves on the whiteboard of w a message for I indicating the 
two areas and Note that O(logn) bits are sufficient to code this 
message. 

5. Agent r traverses and starts exploring Ul_^_i- (Proceeds to the next 
round ~ increment i and go to step 2 ...) 

6. When (if) I returns to w, it finds the message and starts exploring 

(Proceeds to the next round - increment i and go to step 2 ...) <ii> 
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Theorem 2. Two agents can find the black hole performing 2nlogn+ 0(n) 
moves (in time 2nlogn + 0(n)/ 

From Fact 3 and Theorems ^ and 0 it follows that 

Corollary 3. Algorithm^is size-optimal and cost-optimal. 

3.2 More Than Two Agents: Improving the Time 

In this section we study the effects of having k > 2 agents in the home base. 

We know that increasing k will not result in a decrease of the total number 
of moves; in fact, the lower bound of Theorem 0 is independent of the number 
of agents and is already achieved, within a factor of two, by fc = 2. However, the 
availability of more agents can be exploited to improve the time complexity of 
locating the black hole. 

The following theorem shows a simple lower bound on the time needed to 
find the black hole, regardless of the number of agents in the system. 

Theorem 3. In the worst case, 2n — 4 time units are needed to find the black 
hole, regardless of the number of agents available. 

We now show that the lower bound can be achieved employing n — 1 agents. 
Let ri . . . be the n — 1 agents. 

Algorithm 2 (n — 1 Agents) 

Each agent is assigned a location z + 1; its task is to verify whether that is 
the location of the black hole. It does so in two steps, executed independently 
of the other agents. 

Step 1: It first goes to node i in clockwise direction and, if successful, returns to 
the home base (phase 1). 

Step 2: It then goes in counter clockwise direction to node —{n — i — 2) and, if 
successful, returns to the home base: the assigned location is where the black 
hole resides. 

Clearly, only one agent will be able to complete both steps, while the other n — 2 
will be destroyed by the black hole. <s> 

Theorem 4. The black hole can be found in time 2n — 4 by n — 1 agents starting 
from the same node. 

Thus, by Theorems 0 and 0 it follows that 

Corollary 4. Algorithm\^is time-optimal. 

We now show how to employ the idea used for the time-optimal algorithm to 
obtain a trade-off between the number of agents employed and the time needed 
to find the black hole. Let q {1 < q < logn) be the trade-off parameter. 
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Algorithm 3 (Time-Size Tradeoff ) 

Two agents (called explorers) are arbitrarily chosen; their task is to mark all 
the safe ports before the black hole. They do so by leaving the home base in 
opposite directions, moving cautiously; each will continue until it is destroyed 
by the black hole. 

The other agents start their algorithm in pipeline with the two explorers, always 
leaving from safe ports. The algorithm proceeds in q rounds. 

In each round — 1 agents {ri . . . r^i/q_i) follow an algorithm similar to Algo- 
rithm |3 to reduce the size of the unexplored area by a factor of The unex- 
plored area is in fact divided into segments Si, S 2 , ■ ■ ■ , of almost equal 
size (e.g., at the first phase the segment Si is {i — -|- 1, . . . , 

Agent ri verifies the guess that the black hole belongs to segment Si by checking 
the nodes around Si (first the right one, then the left one). 

Clearly only one agent, say ri, will be able to locate the segment containing the 
black hole. When ri verifies its guess, arriving to the node to the left of Si, the 
agents rj with j < i are trying to enter Si from the left, while the agents rj for 
j > i are still trying to enter Si from the right. To use these agents in the next 
round, has to “wake them up” : before returning to the home base, moves 
left (possibly entering Si) up to the last safe port, awakening all rj with j < i, 
so that they can correctly proceed to the next round. 

The process is repeated until the black hole is located in round q. <i> 

Notice that, except for the two exploring agents, all agents survive. 

Theorem 5. Let 1 < q < logn. The black hole can be found using 1 

agents in time 2{q -|- l)n — o(n). 

4 Dispersed Anonymous Agents 

In this section we examine the case when the agents are anonymous but dispersed 
(i.e., initially there is at most one agent at any given location). The number k 
of agents is not known a priori. 

4.1 Basic Properties and Lower Bounds 

A simple but important property is that, although anonymous, the agents can 
uniquely identify each other by means of purely local names. This is easily 
achieved as follows. Each agent a will think of the nodes as numbered with con- 
secutive integers in the clockwise direction, with its starting node (its ’’home- 
base”) as node 0. Then, when moving, agent a will keep track of the relative 
distance da from the homebase: adding -|-1 when moving clockwise, and —1 oth- 
erwise. Thus, when a encounters at the node (at distance) da = —3 an agent b 
which is at distance db = +2 from its own homebase, a is able to unambiguously 
determine that b is the unique agent whose homebase is node —5 (in a’s view of 
the ring). 
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Lemma 3. Each agent can distinguish and recognize all other agents. 

Another simple but important property is that, unlike the case of co-located 
agents, with dispersed agents there is a major difference between oriented and 
unoriented rings. In fact, if the ring is unoriented, two agents no longer suffice 
to solve the problem: they could be located in the nodes next to the black hole, 
and both made to move towards it. In other words. 

Fact 4. At least three dispersed agents are needed to loeate the black hole in the 
unoriented ring. 

Thus, when dealing with the unoriented ring, we assume that there are at least 
three dispersed agents. 

We now establish a lower bound (that we will prove to be tight in the next 
section) on the cost for locating the black hole; the lower bound is established 
for oriented rings and, thus, applies also to the unoriented case. 

Theorem 6. The cost of locating the black hole in oriented rings is at least 
n{nlogn). 

We now consider the case when every agent is endowed with a priori knowl- 
edge of k. This additional knowledge would provide little relief, as indicated by 
the following lower bound. 

Theorem 7. If k is known a priori to the agents, the cost of locating the black 
hole in oriented ring is J7(nlog(n — k)). 

The proof of Theorem [^considers a worst-case scenario: an adversarial place- 
ment of both the black hole and the agents in the ring. So, one last question is 
whether, knowing k we could fare substantially better under a (blind but) favor- 
able placement of the agents in the ring; i.e., assuming that k is known a priori 
and that we can place the agents, leaving to the adversary only the placement of 
the black hole. Also in this case, the answer is substantially negative. In fact, the 
application of the proof technique of Theorem E (with the initial explored region 
set to be the smallest connected region containing all agents, which is clearly of 
size at most n — n/k) yields a lower bound of I2{nlog{n/k)) = l7(n(logn — log k), 
which, for reasonably small k, is still l7(nlogn). 



4.2 Oriented Rings: A Cost-Optimal Algorithm 

In this section we describe a cost-optimal algorithm for the oriented ring where 
k > 2 anonymous agents are dispersed. The algorithm is composed of three 
distinct parts: pairing, elimination, and resolution. 

The basic idea is to first form pairs of agents and then have the pairs search 
for the black hole. 
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Algorithm 4 (Pairing) 

1. Move along the ring clockwise using cautious walk, marking (direction and 
distance to the starting node) the visited nodes, until arriving to a node 
visited by another agent. 

2. Chase that agent until you come to a) a node visited by two other agents or 
b) the last safe node marked by the agent you are chasing 

Case a) Terminate with status alone. 

Case b) Form a pair: Leave a mark Join me and terminate with status 
paired-left 

3. When, during your cautious walk, you encounter the mark Join me, clear 
this mark and terminate with status paired-right. 

4. If you meet a paired agent, terminate with status alone. 

The agents with status paired- will then execute the algorithm to locate the black 
hole. The agents terminating with status alone will be passive in the remainder 
of the computation. 

Lemma 4. At least one pair is formed during the pairing phase. The pairing 
phase lasts at most 3n — 6 time units, its cost is at most 4n — 7. 

Note that, if the pairing algorithm starts with k agents, any number of pairs 
between 1 and [/c/2j can be formed, depending on the timing. For example 
[k/2\ pairs are formed when the “even” (as counting to the left from the black 
hole) agents are very slow, and the “odd” agents are fast and catch their right 
neighbors. 

Since agents can distinguish themselves using local names based on their 
starting nodes (LemmaEI), also the pairs can be given local names, based on the 
node where the pair was formed (the “homebase” ) . This allows a pair of agents 
to ignore all other agents. Using this fact, a straightforward solution consists of 
having each pair independently execute the location algorithm for two agents 
(Algorithm Q. This however will yield an overall O(n^logn) worst-case cost. 

To reduce the cost, the number of active pairs must be effectively reduced. 
The reduction is done in a process, called elimination, closely resembling leader 
election. In this process, the number of homebases (and thus pairs) is reduced 
to at most two. 

The two agents in the pair formed at node v will be denoted by r^ and ly, and 
referred to as the right and the left agent, respectively; v will be their homebase. 



Algorithm 5 (Elimination) 

The computation proceeds in logical rounds. In each round, the left agent ly* 
cautiously moves to the left until it is destroyed by the black hole (case 0), or it 
reaches a homebase u with higher (case 1) or equal (case 2) round number. In 
case (I), ly* returns to v which it marks Dead. In case (2), ly* marks u as Dead 
and returns to v, if v is not marked Dead, it is promoted to the next round. 
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Similarly, agent cautiously moves to the right until it finds (if it is not 
destroyed by the black hole) the first homebase u in equal or higher round; it 
then returns back to v. If the current level of v (its level could have risen during 
the travel of r„=t=) is not higher then the level of u, v is marked Dead] otherwise, 
if V is not marked Dead, travels again to the right (it is now in a higher 
round) . 

To prevent both agents of a pair entering the black hole, both ly* and 
maintain a counter and travel to distance at most [(n — 1)/2J. If one of them 
has traveled such a distance without finding another homebase with the same 
or higher round, it returns back to v, and v is marked Seleeted. <[> 

The rule of case 1 renders stronger a homebase (and, thus, a pair) in a higher 
logical round; ties are resolved giving priority to the right node (case 2 and the 
handling of the right agent). This approach will eventually produce either one 
or two Selected homebases. If an agent returns to a homebase marked Dead, 
it stops any further execution. When a homebase has been marked Selected, 
the corresponding agents will then start the resolution part of the algorithm by 
executing Algorithm H and locating the black hole. Note that, for each of the 
two pairs, the execution is started by a single agent; the other agent either has 
been destroyed by the black hole or will join in the execution upon its return to 
the homebase. Summarizing, the overall algorithm is structured as follows. 

Algorithm 6 (Overall) 

1. Form pairs of agents using Algorithm 0 

2. Reduce the number of pairs using Algorithm O 

3. Find the location of the black hole using Algorithm [I] ^ 

Theorem 8. In oriented rings, the black hole can be found by k > 2 dispersed 
anonymous agents in 0(n log n) time and cost. 

Thus, by Theorems 0 and 0 it follows that 
Corollary 5. Algorithm\^is cost-optimal. 

4.3 Oriented Rings: Considerations on Time 

In the previous section we have shown that the lower bound on the cost is tight, 
and can be achieved by two agents. This implies that the presence of multiple 
agents does not reduces the cost of locating the black hole. The natural question 
is whether the presence of more agents can be successfully exploited to reduce 
the time complexity. 

Unlike the case of co-located agents, now the agents have to find each other 
to be able to distribute the workload. Note that, if the agents are able to quickly 
gather in a node. Algorithm 0can be applied. As a consequence, in the remainder 
of this section, we focus on the problem to quickly group the agents. 

If the number of agents k is known, the gathering problem can be easily 
solved by the following algorithm: 
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Algorithm 7 (Gathering — k known) 

1. Each agent travels to the right using cautious walk. 

2. When arriving at a node already visited by another agent, it proceeds to the 
right via the safe port. 

3. If there is no safe port, it tests how many agents are at this node; if the 

number of agents at the node is k — 1, the algorithm terminates. <> 

Eventually, since all agents travel to the right, all but one agent (which will 
reach the black hole) will be at the same node (in the worst case, the left neighbor 
of the black hole). Since, using cautious walk, it takes at most 3 time units to 
safely move to the right, and since there are at most n — 2 such possible moves, 
this yields the following lemma: 

Lemma 5. If the numbers of agents k is known, k — 1 agents can gather in an 
oriented ring in time 3n — 6. 

This strategy can not be applied when k is unknown. In fact, while the agents 
can follow the same algorithm as in the previous case, they have no means to 
know when to terminate (and, thus, to switch to Algorithm EJ . 

Actually, if causal time complexity is considered (i.e., length of the longest 
chain of causally related moves, over all possible executions of the algorithm), 
the additional agents can be of little help in the worst case: 

Lemma 6. The causal time complexity of locating the black hole in an oriented 
ring, using k agents is at least n(logn — logfc) — 0(n). 

However, if the bounded delay time complexity is considered (i.e., assuming 
a global clock and that each move takes at most one time unit), the additional 
agents can indeed help. Initially, all agents are in state alone. 

Algorithm 8 (Gathering — k unknown) 

Rules for alone agent r: 

1. Cautiously walk to the right until you meet another agent r' . 

2. If r' is in state alone, form a group Q (r and r' change state to grouped) 
and start executing the group algorithm. 

3. Otherwise (r' is in state grouped, belonging to the group G', formed at 
the node g') join the group Q': Go to g' and set your state to Join[g']. 

Rules for group G formed at node g, consisting of \G\ agents: 

Execute Algorithm 0 using \G\ agents, with the following actions taken after 
finishing each phase and before starting the next one: 

1. If any of your agents have seen agents of another group G' whose starting 
node g' is to the right of g, join group G' by sending all your agents to 
g' , with state Join\g'\. 

2. Otherwise add all the agents waiting at g with state Join[g] to G and 

execute the next phase of Algorithm 0 using the updated number of 
agents. <t> 

Theorem 9. In oriented rings the black hole can be located by k = agents 
in bounded delay time complexity 0{qn^^'^). 
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4.4 Unoriented Ring 

If the ring is unoriented, at least three dispersed agents are needed to locate 
the black hole (Fact 4). Thus we assume that there are at least three dispersed 
agents. 

It is easy to convert a solution for oriented rings into one for the unoriented 
ones, at the cost of twice the number of moves and of agents. 

Lemma 7. Let A be an algorithm for oriented ring which, using p agents solves 
problem V in time T and cost C . Then there is an algorithm A! for unoriented 
ring which, using 2p—l agents, solves V in time T and complexity at most 2C. 

Note that lemma Q can be applied to all previous algorithms presented for scat- 
tered agents, except Algorithm 0 

From Theorem 0 Lemma 0 and Fact 4, it follows that: 

Theorem 10. Three (anonymous dispersed) agents are necessary and sufficient 
to locate the black hole in an unoriented ring. 
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Abstract. We study the efficiency of randomised solutions to the mu- 
tual search problem of finding k agents distributed over n nodes. For a 
restricted class of so-called linear randomised mutual search algorithms 
we derive a lower bound of + 1) expected calls in the worst case. 

A randomised algorithm in the shared-coins model matching this bound 
is also presented. Finally we show that in general more adaptive ran- 
domised mutual algorithms perform better than the lower bound for the 
restricted case, even when given only private coins. A lower bound for 
this case is also derived. 



1 Introduction 

Buhrman et al. !RFC+f)f)j introduce the mutual search problem, where k agents 
distributed over a complete network of n distinct nodes are required to learn 
each other’s location. Agents can do so by calling a single node at a time to 
determine whether it is occupied by another agent. The object is to find all other 
agents in as few calls as possible. Buhrman et al. study this problem extensively 
for synchronous and asynchronous networks, and both in the deterministic and 
randomised case, predominantly for k = 2 agents. 

The prime motivation for studying this problem is the cost of conspiracy 
start-up in secure multi-party computations, or Byzantine agreement problems. 
Traditionally, this area of research assumes that all adversaries have complete 
knowledge of who and where they are, and that the adversaries can immediately 
collude to break the algorithm. The question is how hard it is to achieve this 
coordination, and how much information the good nodes in the system learn 
about the location of the adversaries during this coordination phase. For details 
and more examples we refer to Buhrman et al. llBFC+991 . 

Lotker and Patt-Shamir focus on randomised solutions to the mutual 

search problem in the special case of fc = 2 agents. They prove a lower bound of 
expected calls in the worst case for any synchronous randomised algorithm 
for mutual search, and present an algorithm achieving this bound in the shared 
coins model. 

* Id: rnd-mutsearch.tex,v 1.15 2001/06/25 13:27:01 hoepman Exp 

J. Welch (Ed.): DISC 2001, LNCS 2180, pp. Isn- ITm 2001. 
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This paper is the first to consider the general case of k < n agents for 
randomised solutions. We generalise the results of Lotker and Patt-Shamir to a 
lower bound oiQ 

k — I , k — 1 , ,, 

expected calls in the worst case for a restricted class of linear synchronous ran- 
domised algorithms (see Section |2| for an explanation of this term) that only 
depend on the calling history in a limited fashion. We note here that all mutual 
search algorithms for A: = 2 agents are linear by definition. We also present an 
algorithm achieving this bound, in the shared coins model. 

Compared to the upper bound of Buhrman et al. |RFC+99j of 

k{k-l) 

k(k-l) + l "" "" 

for the deterministic case, we see that our randomised algorithm can handle 
roughly the square of the number of agents at the same or less cost. 

Moreover, we show that more adaptive randomised mutual search algorithms 
outperform linear algorithms, even when the nodes are given access to a private 
coin only. Using shared coins we obtain a randomised algorithm whose worst 
case expected cost equals 



fc — 1 -1- 



k-1 
k + l 



k-2 



{n — k) . 



For the private coins model, and in the particular case where k = n— 1, we present 
a randomised algorithms whose worst case expected cost equals A: — 1 -I- ^ . 
Finally, we derive a lower bound of 



expected calls in the worst case for any randomised mutual search algorithm. 

For k > 2, there are basically two choices on how to proceed when two agents 
find each other. They either just record each other’s location and proceed to find 
all other agents independently, or they ‘merge’ into a single node where a single 
master agent in the merger is responsible for making all calls. In the second case, 
a call reaching any agent in the merged set discovers all agents in the set. Our 
results hold in the latter, merging agents, model. Lower bounds in this model 
also hold for the independent agents model. 

The paper is organised as follows. In Section 0 we extend the notion of 
a mutual search algorithm to A; > 2 nodes. We then present our results on 
linear algorithms in Section 0 This includes the upper bound in Section Id.lL a 

^ The second formulation of the bound visually separates the minimal number of suc- 
cessful calls A: — 1 from the fraction of unsuccessful calls among the n — k unoccupied 
nodes. 
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discussion on linear algorithms in Section rOl and the lower bound in Section rOl 
Next, we present the upper bounds for unrestricted mutual search algorithms 
in Section 0 both in the shared coins model (Section 14.1 j) and the private coins 
model ISection I4.2jl . Section l4. Ill contains the lower bound for the unrestricted 
case. We conclude with some directions for further research. 



2 Preliminaries 

Let y be a set of n labelled nodes, some of which are occupied by an agent. 
The set of nodes occupied by an agent is called an instance of the mutual search 
problem, and is represented by a subset I of the nodes V. We use k = \I\ for the 
number of agents. Agents are anonymous, and are only identifiable by the node 
on which they reside. 

The nodes V lie on a completely connected graph, such that each agent can 
call every other node to determine whether it is occupied by another agent. If 
so, such a call is successful, otherwise it is unsuccessful. Nodes not occupied by 
agents are passive: they cannot make calls, and cannot store information to pass 
from one calling agent to the next. 

Initially each agent only knows the size and labelling of the graph, the total 
number of agents on the graph, and the label of the node it occupies. The mutual 
search problem requires all agents on the graph to learn each other’s location, 
using as few calls as possible (counting both successful and unsuccessful calls). 
A mutual search algorithm is a procedure that tells each agent which nodes to 
call, such that all agents are guaranteed to find each other. We only consider 
synchronous algorithms where in each time slot t a single agent makes a call. 
For each agent, the decision whether to make a call, and if so, which node to 
call, depends on the time slot t, the label of the node on which it resides, and 
on the result of the previous calls made and received. 

With each successful call, agents are assumed to exchange everything they 
know so far about the distribution of agents over the graph. We distinguish 
two models. In the independent agents model, each agent is responsible to learn 
the location of all other agents on its own (of course using all information ex- 
changed with a successful call). In the merging agents model, a successful call 
transfers the responsibility to find the remaining agents to the agent called (or 
the agent responsible for making calls on this agent’s behalf JH In this model 
the mutual search algorithm conceptually starts with k clans (or equivalence 
classes |BFG+99] l each containing a single agent. Each successful call merges 
two clans into a single clan with a single leader, until a single clan with all k 
agents remains. Information flow within a clan is for free: all agents in a clan im- 
plicitly learn the location of newfound agents. As the latter model is potentially 
more efficient, we only consider the merging agents model. 

We note that there is a trivial lowerbound of fc — 1 calls for any mutual 
search algorithm for k agents over n nodes, provided that k > 2 and fc < n. If 

^ This is similar to the merging fragments approach for constructing minimum weight 
spanning trees used by Gallager et al. Era- 
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the algorithm makes less than k — 1 calls, at least one of the agents did not make 
or receive a call and therefore cannot know the location of the other agents. For 
n = fc (or fc = 1) the cost drops to 0, because agents know n and k. 



2.1 Algorithm Representations 

For fc = 2, a deterministic synchronous mutual search algorithm can be repre- 
sented by a list of directed edges E. Edge mi is at position t in the list if an 
agent on node v needs to query node w at time slot t. The order in the list 
is represented by We will call this list the call list. The actual calls made 
will depend on the distribution of the agents over the nodes. In this setting, 
A = (V, E, completely describes a synchronous deterministic mutual search 
algorithm for k = 2 agents. In fact, because the algorithm needs to make sure 
that any pair of agents can find each other, the graph (V,E) is a tournament. 

For k > 2 the picture is much less clear. As an agent learns the location of 
some of the other agents, it may adapt itself by changing the order of future 
calls, or dropping certain calls altogether. One way of describing such a mutual 
search algorithm would be to divide each run of the algorithm on a particular 
instance / in rounds. Whenever the algorithm makes a successful call (finding 
another agent or clan and merging the clans), a new round starts. The algorithm 
starts in round 1 to find the first pair of agents, and stops after the fc — 1-th round 
when all k agents are merged into on^. Round boundaries represent knowledge 
changes. At such a boundary, the algorithm has to decide on a new strategy to 
find the next agent, and has to stick to this strategy until it finds it (or gets 
found). This strategy change cannot be global: only merged nodes can change 
their strategy, because they are the only nodes that are aware of the merge. Also, 
this strategy change can only depend on the agents found so far. 

For a restricted class of so called linear algorithms we will derive matching 
upper and lower bounds. The class of linear mutual search algorithms defined 
below includes algorithms which are only marginally adaptive, but also forward 
calling algorithms that merge agents one by one into one single growing clan. 

Definition 2.1. A mutual search algorithm A to find k agents is linear, if for 
all instances I of size fc -I- 1 

— A makes at least k — 1 successful call^ when started on instance I , and 

— for the first k — 1 successful calls mi made by A when started on instance 
I, A still makes a (now unsuccessful) call to w when started on instance 
I-{w}. 



This topic is discussed in Sect, We note that any mutual search algorithm 
for k = 2 agents is always linear. 

® We assume that no algorithm makes calls between agents in the same clan (that 
already know each other’s location). Therefore the graph of successful calls is acyclic, 
and spans all k agents when k — 1 successful calls have been made. 

It is not immediately obvious that an algorithm to find k agents when started on an 
instance of k' > k agents will always run until it has made k — 1 successful calls. 
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2.2 Properties of Binomials 

The following properties of binomials are used in some proofs in this paper. 

Property 2.2. We have 



c — 1 
k-2 



= {k-l) 



c 

k-l 



From Feller p. 64, we get 



E 

v—O 

Using this equation we derive 



V + k — 1\ f r + k 

k-l 



n— 1 

E 



c 

k-l 



c—k—l 

Combining Equation (EJ) and 0 we conclude 



E < 

c—k — l 



C — 1 
k-2 



= {k-l) 



( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 



3 Bounds on Linear Algorithms 

In this section we prove a lower bound of 

k — 1 , ^ 

on the worst case expected number of calls made by any randomised mutual 
search algorithm to find k agents among n > k nodes, and match this with 
a randomised algorithm (in the shared coins model) achieving this bound. We 
present the algorithm first, and then prove the lower bound. 



3.1 Upper Bound 

The upper bound is proven using the following straightforward algorithm. 

Algorithm 3.1. First all nodes relabel the nodes according to a shared ran- 
dom permutation of {0, ... ,n — 1}. The algorithm then proceeds in synchronous 
pulses (starting with pulse 0/ An agent on node i (after relabelling) is quiet upto 
pulse i. 

— If it does not receive a call at pulse i — 1 (or if i = 0^, in pulses + 1, . . . 
it calls nodes * + 1,* + 2, . . . until it has contacted all k — 1 other agents. 

— If a node i receives a call at pulse i — 1, it remains quiet for all future pulses. 
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Using this algorithm the agent on the first node (after relabelling) of the instance 
will call all other nodes. All other agents will remain silent. We note that the 
algorithm is linear. 

Theorem 3.2. Alnorithm Xi . /I has worst case expected east 

k — 1 

Proof. Let J be a distribution of k agents over the n nodes. By performing the 
relabelling, the algorithm R in effect performs input-randomisation and selects 
a random instance I' after which it continues deterministically performing c(J') 
calls. Note that in pulse i, R calls i -I- 1 if and only if there is an agent at some 
j < j -I- 1 and not all agents have been found yet. We conclude that c(I') equals 
the distance between the first and last node in 
The expected cost of any instance I is given by 

^ c(/')Pr[I=/'] . 

\i'\=k 



Splitting into instances with equal cost, this is equal to 



^ ^ number of instances I' with |/'| = k and cost c . 



( 5 ) 



The minimal cost is k — 1, the maximal cost is n — 1. For a given cost c, the 
range 0,...,n — c— lofn — c nodes is a viable location for the first agent in 
the instance. Let / denote the position of the first agent. The last agent then 
resides at / -|- c (or else the cost would not equal c). The remaining k — 2 agents 
can be distributed over the range f +1, .. . ,/-|-c— lofc— 1 nodes. This shows 
that Equation © equals 



— 1 n—1 






c—k—1 



which equals 



— 1 / n—1 



E »(::))- E 



k-2 



n — 1 



\c—k—l ^ ' c—k—1 

{Using Equation © and ©.} 

1 / / \ n—1 



C — 1 
k-2 



c 

k-1 



= (k-1) 

(Using Equation ©.} 
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n—1 
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C—k — 1 



n—1 



c—k—1 
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k-l 
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= (k-1) 



-1 



n— 1 



- E ^ 

c—k — l 



C+ 1 



{Using Equation (0).} 
-1 



= (^- 1 ) 



fcj *U+i 






k + 1 



k-l 



(n+ 1) 



k + 1 

This completes the proof. 



n— 1 

E 

c—k—l 



C 

k-l 



3.2 Linear Algorithms 

In this section we give a characterisation of the class of linear mutual search 
algorithms as defined by Definition o This includes two important classes of 
mutual search algorithms defined next. 

Definition 3.3. A mutual search algorithm A is non-adaptive, if for all in- 
stances I and for each successful call f between clan C and C at the end of 
round i the following holds. 

— The call list for round i 1 of the merged clan C U C equals the union of 
the call lists for round i of the clans C and C restricted to the calls ordered 
after the successful call f 

— In the resulting call list, for each call the caller is replaced with the new leader 
of the clan C U C . 

— From this list all calls to nodes already called by one of the clans are removed, 
as well as duplicate calls to the same destination, in which case the first call 
(according to <) is retained. 

Definition 3.4. A mutual search algorithm A is forward-calling, if for all in- 
stances I and for all successful calls mh,x^ made by A on instance I 

w = X implies ^ . 

We have the following characterisation of linear mutual search algorithms. 

Lemma 3.5. Both non-adaptive and forward calling mutual search algorithms 
are linear. 

Proof. Observe that for a forward calling algorithm started on an instance I, 
each successful call of a clan reaches a clan containing a single member agent. 
Therefore, a forward calling algorithm grows a single clan that contains i agents 
during round i. If / contains fc -|- I agents, A stops at round k when the clan 
contains k agents. So A makes k — 1 successful calls. 
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If A calls w in the course of the algorithm, w is the single member of its own 
clan. Hence w made no successful calls and was not contacted by another agent 
up till now. Hence removing the agent from w does not affect the strategy of A 
up to this point, and hence A will call w. 

Observe that for a non adaptive algorithm, a clan C calls a node w iff there 
is a node v in the clan containing an agent for which Wib is on its initial call list 
(at the very start of the algorithm). If we remove the agent from v, all other 
agents in C will call only a subset of the nodes called by C originally, and will 
do so no sooner than C would have done. 

Hence if for an instance I of size k + 1 two clans C, C merge at some time t 
by the call removing the agent from node w in C will not result in any calls 
to C from the remaining agents in C before time t. Hence C cannot distinguish 
these different instances, and will make the same calls (including the call to w). 

This also shows that any clan of size less than k that occurs while running 
the algorithm on an instance I of size k + 1 also occurs on an instance I' of size 
k where an agent is removed from a node not in the clan. Hence, a non-adaptive 
algorithm cannot distinguish between instances of size k and k + 1 untill all 
agents have been found. This proves that a non-adaptive algorithm will make at 
least k — 1 calls on an instance of size k + 1. □ 

3.3 Lower Bound 

We now proceed to prove a lower bound on linear mutual search algorithms. In 
the proof, we use the following result of Yao [YaSZI- 

Theorem 3.6 ( |Yao77] ) . Let t be the expeeted running time of a randomized 
algorithm solving problem P over all possible inputs, where the expected time 
is taken over the random choice made by the algorithm. Let t' be the average 
running time over the distribution of all inputs, minimized over all determinisitc 
algorithm solving the same problem P. Then t > t' . 

Therefore, we first focus our attention to the behaviour of deterministic algo- 
rithms on random instances. 

Fix n and k < n. Let H be a linear deterministic mutual search algorithm 
with |H| = n to locate k agents. 

Definition 3.7. Let L be an instance, and let v,w gV. Define 

if and only ifmh is an unsuccessful call (i.e. v G I,w ^ L) made by algorithm A 
to locate all agents in the instance L . 

Then for the total cost Ca{I) of algorithm A on instance / we have 

CA{I) = k-l+ ^ <Pa(v^,I) , ( 6 ) 

v£l 

w£V-I 

where the k — 1 term is contributed by all successful calls made. 
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Lemma 3.8. Let J be an arbitrary set ofk+1 < n nodes. Let A be linear. Then 

^ - {w})>k-l . (7) 

J 

v^w 

Proof. Run A on instance J. By Definition 12. 1 L A will have made at least k — 1 
successful calls. Take the first k — 1 successful calls. By Def. 12. 1 1 for each of the 
first k — 1 successful calls mf, removing w does not affect the fact that v calls 
it. Hence J — {w}) = 1. Hence the sum is at least k — 1. □ 

This lemma allows us to give a bound on the expected cost of a random instance 
for a given algorithm A. 

Lemma 3.9. Let A be a linear deterministic mutual search algorithm for k 
agents over n > k nodes. Then the expected cost E[C^(I)] of a random instance 
X = I of k agents is bounded by E[C^(I)] > + 1) . 

Proof. 



E[Ca{X)]= ^ Ca{I)Pt[X=I] 



lev 

l/l=fc 



= E 



lev 

\I\=k 



{By Equation (JO).} 

( 



E 

V 
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fc-i+ E 



v£l 

wGV-I 






= k — 1 + 



E E 



lev vei 
\i\=k wev-i 



{Rearranging sums, setting J = / + {rc}.} 



= k — 1 + 



-1 



E E - {w}) 
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Using Theorem 1 , 1 . PI we now derive the following result. 



□ 
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Theorem 3.10. The maximal expected cost of any randomised linear mutual 
search algorithm over n nodes on some instance of size k is at least 



Proof. We first compute the expected cost of a randomised linear algorithm R on 
a random instance of size k of the mutual search problem over n nodes. Because 
the random coin flips used by R are independent of the choice of the instance 
we can first condition on the contents of the random tape. This fixes R, in effect 
making it a deterministic linear algorithm A. The expected cost over a random 
instance is now given by Lemma 13.91 to be at least 



As this is the expected cost of i? on a random instance, there must be an instance 



4 Unrestricted Algorithms 

In this section we investigate the power of more adaptive algorithms whose call 
patterns depend heavily on the presence or absence of earlier successful calls. 
We show that these algorithms perform better than the lower bound for linear 
algorithms, given either a shared coin or a private coin. 

4.1 Shared Coins 

First we restrict our attention to algorithms deploying a shared random coin. 
This shared coin can be used to perform a global relabelling of nodes, as ex- 
plained in section tt.lL 

The following algorithm uses 



expected calls in the worst case. In this algorithm, all agents call node 0, unless 
node 0 does not contain an agent, in which case the first agent that discov- 
ers this calls all other nodes until all remaining agents are found (similar to 
Algorithm 13. 1 1) . 

Algorithm 4.1. Globally relabel the nodes using the shared random coin. Agents 
call other nodes depending on the pulse number (starting with pulse 0) as follows. 

— An agent on node 0 (after relabelling) does not make any calls. 

— An agent on node i > 0 (after relabelling) is quiet upto pulse 2i — 1. 

• If it did not receive any calls at earlier pulses, on pulse 2i — 1 it calls 




k-1 



E[C^(I)]>|^(n+l). 



for which the expected cost is at least this high. 



□ 




node 0. 
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* If this call is successful, it makes no more calls. 

* Otherwise, at pulse 2i, 2i + 1, . . . it calls nodes i + 1, z + 2, . . . until 
it has contaeted all remaining k — 1 agents. 

• If an agent on node i > 0 receives a call before pulse 2i — 1, it remains 
quiet for all future pulses. 



Theorem 4.2. Algorithm o is a mutual seareh algorithm for k agents on n 
nodes, making at most k — 1 + ~ expeeted calls in the worst 

case. 



Proof. We split the proof into two cases. 

an agent at node 0: This case occurs with probability k/n. Here, all remain- 
ing k — 1 agents call node 0, so that k — 1 calls are made in total, and all 
agents are found. 

no agent at node 0: This case occurs with probability {n — k)/n. The first 
agent at node z > 0 that discovers this case in phase 2i — 1, will call all 
remaining agents at nodes j > z. In fact, in this case the k agents run 
algorithm [1. II on n — 1 nodes. Hence the worst case expected cost is given 
by Theorem O as We have to add 1 for the extra call to node 0 to 

arrive at the total cost in this case. 



The worst case expected cost is therefore given by 



k ,, _ n — k 

-{k-l) + 

n n 



1 fc -1 

1 -I- T -n 
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= fc- 1 



k-1 
k + 1 



k-2 

n 



(rz — k) 



This completes the proof. □ 

Note that the bound of Sect. 0can be rewritten to /c — 1 -I- — k). Algo- 
rithm ^31 improves this bound by — k). 



4.2 Private Coins 

Using private coins the agents have only limited possibilities to counter bad 
node assignments made by the adversary. Global relabelling is not possible, for 
instance, because the agents cannot exchange the outcome of their private coin 
tosses. 

We consider the special case of fc = rz — 1, where all nodes except one are 
occupied by an agent. Here, the following algorithm uses 



expected calls in the worst case. 

Algorithm 4.3. This algorithm runs as follows. 
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round 1: If there is an agent at node 0, it either calls node 1 or node 2, each 
with probability 1 / 2 . 

round 2: If the previous call was unsuccessful, the other node is called. 
round 3: If there is an agent at node 2, which did not receive a call from node 
0 , it calls node 1 . 

round 4: If there is an agent at node 1, which did not receive any calls (either 
from node 0 or node 2), it calls node 2. 

round 2 + 1 , 3 < i < n — 1: If there is an agent at node 1, and node 1 did not 
receive a call from node 0 or was called by node 2 it calls node i at round i. 
round n — 1 + /, 3 < j < n — 1: If there is an agent at node j, and this node 
did not receive a call from node 1, then it calls node 0 . 

Theorem 4.4. Algorithm \4-.d\ is a mutual search algorithm for k = n—1 agents 
on n nodes, making k — 1 + 5 expected calls in the worst case. 

Proof. By case analysis, using the fact that all but one node contain an agent. 

no agent at node i > 2: The agent at node 0 either calls node 1 (and then 
node 2 calls node 1 at round 3) or node 2 (and then node 1 calls node 2 at 
round 4). No calls are made at round 2 + 1 for 3 < i < n — 1, and all agents 
on nodes 3 < j < n — 1 call node 0 at round n— 1 + j. One of these nodes 
is not occupied by an agent. In this case, k — 1 calls are made in total, 
no agent at node 0: The agent at node 2 calls node 1 at round 3. Hence the 
agent at node 1 calls the agents at node i for 3<i<n — lat round 2 + i. 
Again, k — 1 calls are made in total. 

no agent at node 1: With probability 1/2, the agent at node 0 calls node 2. No 
other calls are made in round 2, . . . , n + 1. All agents on nodes 3 < j < n — 1 
call node 0 at round n — 1 + j. Again k — 1 calls are made in total. 

With probability 1/2, the agent at node 0 calls node 1. Because this call 
is unsuccessful, it also calls node 2 at round 2. No other calls are made in 
round 3, . . . , n + 1. All agents on nodes 3 < / < n — 1 call node 0 at round 
n — 1 + J. Now fc calls are made in total, 
no agent at node 2: Similar to the previous case. 

The worst case occurs if there is no agent at node 1 or node 2. In either case the 
expected number of calls made equals ^fc+|(fc— 1) = fc — 1+|. This completes 
the proof. □ 



4.3 Lower Bonnd 



Using the results of Section |^] we now derive a lower bound of 



k — 1 + 



n — k 
k + 1 



expected calls in the worst case for arbitrary randomised mutual search algo- 
rithms. We use the same notational conventions and definitions as used in that 
section. 
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Lemma 4.5. Let J be an arbitrary set of k + 1 < n nodes. Let A be an arbitrary 
deterministic mutual search algorithm Then 

- {w}) > 1 ■ ( 8 ) 

V,W^J 

V^W 



Proof. Run A on instance J. Because A is a mutual search algorithm, it will 
make at least one successful call. Let be the first successful call. Now remove 
w from J, and run A on J—{w}. As discussed in Section|21 the call pattern of a 
node depends on its initial state and all calls made so far. For u ^ w, the initial 
state does not change. Moreover, any calls before the call from v to w must, by 
assumption, be unsuccessful. Removing w does not influence the result of any of 
these unsuccessful call (they do not involve w). Removing w only removes any 
unsuccessful calls made by w from the call pattern. They also have no effect on 
the call patterns of the other nodes. Hence all u yf w make the same calls as 
before upto the call to w. Hence v calls w, which is now unsuccessful. □ 



Theorem 4.6. The maximal expected cost of any randomised mutual search 
algorithm over n nodes on some instance of size k is at least 



k — 1 + 



n — k 
k + 1 



Proof. Using the same steps of the proof of Lemma l.t.9l and using Lemma 14. ,91 
we get 



E[Ca{T)] = k-l + 




|J|=fc+i 



v,w^J 

v^w 



{By Lemma ^31} 



> fc - 1 + 




= fc — 1 + 



n — k 
k+1 



The theorem follows from this fact by the same reasoning as used in the proof 
of Theorem mni □ 



5 Further Research 

We have investigated the mutual search problem in the randomised case with 
k > 2 agents. The main questions remaining are the following. 

First of all, we would like to derive efficient adaptive algorithms for an arbi- 
trary number of agents in the private coins model. Moreover, we would like to 
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close the gap between the lower bound and the upper bound in the shared coins 
model for arbitrary, non-linear, mutual search algorithms. 

Secondly, the question is how changes to the model (different graphs, non- 
anonymous agents, independent agents, etc.) affect the cost of mutual search. 

Finally, time constraints may have an adverse effect on the cost of a mutual 
search algorithm. If a synchronous mutual search algorithm is required to ter- 
minate within time t, with full knowledge of the distribution of the agents, the 
algorithm may have to make more than one call per time slot, and may not be 
able to fully exploit the “silence is information” paradigm. 
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Abstract. Self-stabilizing algorithms for constructing a spanning tree 
of an arbitrary network have been studied for many models of distributed 
networks including those that communicate via registers (either compos- 
ite or read/write atomic) and those that employ message-passing. In con- 
trast, much less has been done for the corresponding minimum spanning 
tree problem. The one published self-stabilizing distributed algorithm 
for the minimum spanning problem that we are aware of 0 assumes a 
composite atomicity model. This paper presents two minimum spanning 
tree algorithms designed directly for deterministic, message-passing net- 
works. The first converts an arbitrary spanning tree to a minimum one; 
the second is a fully self-stabilizing construction. The algorithms assume 
distinct identifiers and reliable fifo message passing, but do not rely on a 
root or synchrony. Also, processors have a safe time-out mechanism (the 
minimum assumption necessary for a solution to exist.) Both algorithms 
apply to networks that can change dynamically. 



1 Introduction 

Large networks of processors are typically susceptible to transient faults and 
they are frequently changing dynamically. Ideally, basic primitives used by these 
systems can be made robust enough to withstand these faults and adapt to net- 
work changes. Dijkstra introduced a strong notion of fault tolerance called 
self-stabilization, that can meet these requirements. A distributed system is self- 
stabilizing if, when started from an arbitrary configuration, it is guaranteed to 
reach a legitimate configuration as execution progresses. If a protocol is self- 
stabilizing the system need not be initialized, which can be a significant addi- 
tional advantage especially for physically dispersed systems such the Internet. 

Two important primitives for many protocols in distributed computing are 
construction of a spanning tree and of a minimum spanning tree. For example, 
a distributed message-passing network of processors might rely on an underly- 
ing spanning tree to manage communication. If the cost of using the different 
communication channels varies significantly, it may be desirable to identify the 
spanning tree with minimum cost. So a distributed self-stabilizing (minimum) 
spanning tree protocol, would eventually converge to a global state where each 
processor has identified which of its adjacent edges are part of the required tree, 
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regardless of what each processor originally had identified as in the tree. Once 
identified, this tree would remain unchanged as long as there are no further faults 
and the network does not change. If further changes or faults occur, the network 
should again automatically adjust to identify the (possible new) tree. 



Although there exist several self-stabilizing algorithms for the spanning tree 
problem j/ll ,'11 1 5121111] . we are aware of only one by Antonoiu and Srimani jSj 
for a self-stabilizing construction of a minimum spanning tree (MST). The full 
version of this paper m contains an overview of these spanning tree papers. 
Perhaps the lack of minimum spanning tree solutions is because a generaliza- 
tion of the well-known Distributed Minimum Spanning Tree algorithm (GHS) of 
Gallager, Humblet and Spira HH to the self-stabilizing setting is not apparent. 
The GHS algorithm resembles a distributed version of Krushkal’s algorithm. It 
maintains a spanning forest, the components of which are merged in a controlled 
way via minimum outgoing edges until there is only one component, which is the 
MST of the network. The algorithm relies heavily on the invariant that selected 
edges are cycle-free, which we do not see how to maintain in the self-stabilizing 
setting when specific edges (of small weight) must be selected. 



Antonoiu and Srimani’s paper proposed the first distributed self-stabilizing 
algorithm for minimum spanning tree. Their algorithm is for the shared memory 
model and relies on composite atomicity. One approach would be to transform 
their algorithm to a read/write atomic solution for the link-register model via 
an efficient transformer (for example [Itil4] ) and then to apply a second trans- 
formation from the atomic read/ write model to the message-passing model via 
self-stabilizing algorithms for token-passing and for the data-link problems as, 
for example, described by Dolev mg. Both these transformation can be expensive 
in the worst case. So we are motivated to design directly for the message-passing 
model in hope of finding a more efficient and less complicated solution. 



There are few self-stabilizing algorithms written for the message passing 
model. This paper presents two algorithms for distributed MST. The first, Ba- 
sicJVIST, does not stabilize to the MST of the network from any initial configu- 
ration; rather, it converts any valid spanning tree configuration to the minimum 
spanning tree in a self-stabilizing fashion. This algorithm can be used to maintain 
a minimum spanning tree in a network that has edges and nodes added dynam- 
ically. It also serves to provide some of the ideas for the second algorithm in a 
simpler but less general setting. Given BasicJVIST, another approach to finding 
a general self-stabilizing MST algorithm might be to use the technique of fair 
composition rarni applied to a self-stabilizing algorithm for spanning tree con- 
struction and BasicJVIST. However, we failed to see how to achieve this because 
of the need keep the variables manipulated by BasicJVIST entirely disjoint from 
those used to construct the spanning tree. So our second algorithm is a general 
self-stabilizing deterministic minimum spanning tree algorithm, that uses some 
of the ideas of the first but is developed from scratch. Both algorithms differ 
markedly from the GHS kind of approach. 

Section |2|specifies our model. We next motivate the ideas of our algorithm by 
outlining a sequential algorithm that constructs a MST given an arbitrary span- 
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ning tree (Sectional). While this algorithm is inefhcient sequentially, it adapts 
well to a concurrent environment, and is the motivation for our first algorithm. 
Section 2] gives a brief sketch of this algorithm, Basic_MST, and its correctness. 
All the details are available in the full paper m Although one main idea is con- 
tained in Basic_MST, substantial modifications and enhancements are needed to 
convert it to a general self-stabilizing MST algorithm. Section 0 presents these 
details and the final algorithm, Self_Stabilizing_MST, and its proof of correct- 
ness. Interestingly, significantly different proofs of convergence are needed for the 
two algorithms even though the second algorithm builds upon the ideas used in 
the first. Further comments and future work are briefly discussed in Section El 



2 Model 

An asynchronous distributed message-passing network of processors is modelled 
by a simple, weighted, connected and undirected graph where vertices represent 
processes, edges represent communication links between processes and weights 
represent some measure of the cost of communicating over the corresponding 
link. Each processor P has a distinct identifier, and knowledge only of the iden- 
tifiers of its neighbours and for each neighbour Q, the weight of the edge {P, Q) . 
Edge weights are assumed to be distinct since they can always be made so by 
appending to each weight the identifiers of the edge’s end-points. 

The self-stabilizing MST problem requires that given any initial configuration 
of the network, each processor is required to determine for each of its adjacent 
edges, whether or not it is in the minimum spanning tree of the network. 

Self-stabilization is impossible for purely asynchronous message-passing sys- 
tems We therefore assume that each process in the network is augmented 
with a time-out mechanism that satisfies a necessary safety property, namely: 
each process’s time-out interval is guaranteed to be at least as long as the time 
taken by any message sent by the processor to travel a path of n edges where n 
is the number of processors in the network. For correctness we require that this 
lower bound on the time-out interval is not violated but it can be any (even very 
large) overestimate. The time-out interval could be provided directly or could be 
described as n times a, when a is the maximum time for any message to travel 
any edge. In the first case, knowledge of n is not required; in the second case 
this is the only place when knowledge of n is used. Of course, since our second 
algorithm is self-stabilizing, a violation in the safety of a time-out can, at worst, 
act as a fault from which the algorithm will eventually recover. 



3 Graph Theory Preliminaries 

We begin with a new sequential algorithm that finds the minimum spanning 
tree of a graph assuming that an arbitrary spanning tree is already known. This 
algorithm is less general and less efficient than well-known greedy solutions to 
the MST problem such as Kruskal’s and Prim’s algorithms 0, however, it has 
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some properties that are appealing in concurrent settings and network models 
and it adapts well to self-stabilization. 

Let graph G = {V, E) denote a connected, weighted, undirected graph with 
n vertices and m edges. Let T = (V, E') denote an arbitrary spanning tree of G. 
A spanning tree is minimum if the sum of the weights of its edges is as small as 
possible. If all edge weights are distinct, the minimum spanning tree (MST(G)) 
is unique. An edge e € E' is called a tree edge and e € E\E' is a, non-tree edge. 
If {vqjVi . . .Vk) is a path in T, and (vk,vo) is a non-tree edge, then the cycle 
(t>o, . . . ,Vk,vo) in G is called a fundamental cycle of T containing (vk,vo). The 
proof of the following graph-theoretic propositions are from basic graph theory 
or are derived using standard techniques. Complete proofs are given in the full 
paper P!- 

Proposition 1. If e is in MST(G), then e is not a maximum edge in any cycle 
ofG. 



Proposition 2. For any non-tree edge e, there is exactly one fundamental cycle 
ofT containing e. 

Let fnd_cyl{E' ,e) denote the unique fundamental cycle of T = {V,E') con- 
taining e. Denote by max{fndmyl{E' ,e)), the edge with maximum weight in 
fnd_cyl(if', e). For non-tree edge e, the function minimizemycle^E' , e) returns 
a new set of edges and is defined by 

minimi z e -cy d e {E' ,e) = {E' U{ e}) \ {max{fnd-cyl{E' , e))}. 



Proposition 3. For non-tree edge e, minimize-cycle{E' , e) is an edge set of a 
spanning tree of G. 



Proposition 4. E' is the edge set of MST{G) if and only ifie G E\E', 

E' = minimi z e -cy d e {E' , e). 

The next proposition says that once minimize .cycle removes an edge from a 
spanning tree by replacing it with a lighter one, no subsequent application of 
minimize .cycle can put that edge back into the spanning tree. 

Propositions. Consider a sequence of spanning trees Tq = {V,Eq),Ti = 
(y, El), T 2 = {V, E 2 ), ■ ■ ■ , where Tq is any spanning tree of G, and for i > 1, 

Ei = minimizej:yde{Ei_i, Ci-i) for some edge ei_i € E \ Ei_i. 

Let e* = max{fnd-cyl{Ei-i, e^-i)). Then Vi > 1, e* ^ Ej for any j > i. 

The preceding propositions combine to provide the strategy for an algorithm 
that converts an arbitrary spanning tree into the MST. Specifically, if mini- 
mize .cycle is applied successively for each non-tree edge in any order for any 
initial spanning tree, the result is the minimum spanning tree. 
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Proposition 6. Let Tq = {V,Eq) be a spanning tree of G = (V,E) and 
{ei,e 2 , ... ,em-n} = E\Eo (in any order). Let Ei = minimize-cycle{Ei-i,ei), 
for i = 1 to m — n. Then T.^n-n = (^) Em-n) = MST{G). 

Proof. Proposition Elimplies that \/e € E \ Em-m e = max{fnd-cyl{Em-n, e)). 
So Em-n = minimize-Cycle{Em-n, e), Ve G i? \ Em-n- So by Propositional 
Em-n is the edge set of MST{G). ■ 

4 Construction of a Minimum Spanning Tree from a 
Spanning Tree 

From Proposition El we see that the application of minimize_cycle to each of 
the non-tree edges results in the MST regardless of the order of application. 
This suggests that the minimize_cycle operations could proceed concurrently 
provided care is taken that they do not interfere with each other. This is the 
central idea for a distributed algorithm, called BasicJVlST that identifies the 
minimum spanning tree of a network provided the network has already identified 
an arbitrary spanning tree. 

Because of space constraints, we describe Basic_MST only informally here 
and give only the highlights of the proof of correctness. The full description and 
a detailed proof can be accessed online m 

The description of algorithms Basic_MST and Self_Stabilizing_MST are sim- 
plified by temporarily changing perspective to one where we pretend that com- 
munication edges do the processing and that nodes act as message-passing chan- 
nels between adjacent edges. Call the graph representing this particular network 
setting altered(G). That is, given a message-passing network of processors mod- 
elled by a graph G, we describe our algorithm for the network that is modeled 
by altered(G) where each edge has access to the identifiers of the edges incident 
at each of its end-points. It is not difficult (though notationally tedious!) to show 
how the original network, G, simulates an algorithm designed for the network 
altered (G) IT^ . Section El contains an informal description of how this is done. 

Each edge has a status in {chosen, unchosen} such that edges of the initial 
spanning tree are chosen and non-tree edges are unchosen. Algorithm Basic_MST 
maintains the invariant that chosen edges never form a cycle. The goal is that 
chosen edges eventually are exactly the edges of the MST. Recall that each edge- 
processor has a timer that, when initialized, has a value that is guaranteed to be 
as least as big as the time for a message to travel a path of length n beginning at 
that processor. Let safetime(e) be this value for edge e. An edge’s timer is reset to 
its safetime upon receipt of a message that originated from that edge, otherwise 
the edge eventually times out. Upon time-out, each unchosen edge e = (x, y) 
initiates a search for a heavier edge in the fundamental cycle containing e. Edge 
e does this by sending a search message containing e’s identity and weight to all 
edges adjacent to one of e’s end-points, say x. Any search message is discarded 
by any unchosen edge that did not initiate it. When a chosen edge, however, 
receives a search message at one end-point, it updates the message weight with 
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the greater of the current message weight and its own weight. It then propagates 
the search to all edges adjacent to its other end-point. Because chosen edges 
are cycle- free, at most one copy of the search message initiated by e can be 
received by e (necessarily at the end-point y) . All other copies die out when they 
reach leaves of the current tree of chosen edges. When e receives its own search 
message, it contains the weight of the maximum weight edge in the fundamental 
cycle traversed by this search message. If that weight is not e’s, then a heavier 
edge has been found. In this case, e initiates another message that attempts to 
remove the identified heavy edge and insert e into the collection of chosen edges. 
Otherwise, e waits for its time-out to repeat the search. 

What can go wrong with this algorithm? Notice that if the searches and 
replaces proceed serially, then Basic_MST is just an implementation of mini- 
mizemycle repeated for all non-tree edges and hence is correct by Propositional 
The full paper proves that the only problem that can arise due to concurrency 
is when a heavy edge, say e, is identified by more than one search procedure. In 
this case the remove and insert procedures fail safely. Specifically, the first re- 
move message received by e will be successful, and will cause e to send an insert 
message to its initiator. The subsequent remove messages that e receives will 
be ignored because e has changed status to an unchosen edge. Thus no corre- 
sponding insert message is generated; the initiating edge e will remain unchosen, 
will time-out waiting for the insert response and upon time-out will initiate a 
new search. To complete the proof, we focus on those steps of the scheduler that 
trigger the remove and insert procedures, which are called major steps. We first 
show that the weight of the chosen set decreases at every major step. We next 
show that if the chosen set is not the MST, then another major step must occur. 
Thus, the chosen set must be converted to correspond exactly to the edges of 
the minimum spanning tree. 

Proposition implies that there can be at most m — n successful remove- 
insert executions before the MST is identified for a network with n nodes and 
771 edges. When distinct heavy chosen edges are identified for removal (by search 
messages originating from distinct non-tree edges) their replacement (by remove 
and insert messages) can proceed concurrently. In the worst case, however, we can 
construct a graph and a scheduler that force all replacements to proceed serially, 
each taking at most 3 * (Safetime) where Safetime is the maximum safetime(e), 
for a total stabilization time of at most 3(m — ?7)Safetime. Notice, however, 
that while the MST is being constructed, a spanning tree is maintained, which 
can be used in place of the MST while the fine tuning to the minimum weight 
tree proceeds. Furthermore, algorithm Basic_MST will continue to adjust and 
converge to the MST even if the network dynamically adds new communication 
edges, or new nodes, or revises the weights of some edges. We need only preserve 
the invariant that the chosen edges form a spanning tree. Thus the MST will be 
automatically adjusted as edge weights are revised. Provided any new edge is 
inserted with status unchosen, and any new node is inserted so that exactly one 
of its new incident edges has status chosen, again Basic_MST will automatically 
revise the chosen set to identify the new MST. 
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Algorithm BasicJVIST is correct of any initial values of the timers and any 
initial program counter values. For correctness we do, however, require that 
initially the chosen edges form a spanning tree, and there are no erroneous 
messages in the network. Some additional techniques are required to achieve a 
fully self-stabilizing algorithm. 



5 A Self-Stabilizing Minimum Spanning Tree Algorithm 

5.1 Informal Description and Intnition 

Algorithm BasicJVIST guarantees convergence to the MST only if, in the initial 
configuration, the chosen edges form a spanning tree, and there are no messages 
in the network. Algorithm Self_Stabilizing_MST alters and enhances Basic_MST 
so that the minimum spanning tree is constructed even when, initially, the chosen 
edges are disconnected or do not span the network or contain cycles and when 
there may be spurious messages already in the system. 

We again described the algorithm for the altered graph where edges are 
assumed to do the processing. Recall that each edge processor, e, has a time- 
out mechanism and an associated safetime(e) that is at least as long as the 
time for any message it sends to travel any path of length at most n. Just as 
in algorithm Basic_MST, in algorithm Self_Stabilizing_MST, when an unchosen 
edge e times out, it initiates a search message containing e’s identifier and weight, 
which propagates through chosen edges. As the propagation proceeds, the search 
message is updated so that it contains the weight of the heaviest chosen edge 
travelled by the search message. When e receives its own search message, it 
resets its timer to its safetime and, if it is heavier than the heaviest chosen 
edge travelled by the search message, e becomes passive until its next time-out. 
Otherwise, a heavier edge has been detected in a fundamental cycle containing 
e. If so e adds itself to the chosen set, and initiates a remove message destined 
for the heavy edge and intended to remove it from the chosen set. 

The intuition for the enhancements is as follows. Suppose the chosen edges 
are disconnected or do not span the network (or both). Then there is at least 
one unchosen edge e whose end-points, say x and y, are not connected by a path 
of chosen edges. So a search message initiated by e out of its x end-point cannot 
return to e’s y end-point. This is detected by e through its time-out mechanism 
and the boolean variable search-sent, which indicates that e is waiting for the 
return of its search message. When this detection occurs e simply changes its 
status to chosen. 

Suppose a collection of chosen edges form a cycle. To detect cycles of chosen 
edges, each search message is augmented to record the list of edges on the path 
it travelled. If a chosen edge receives a search message at one end-point, and the 
list in that search message contains a chosen edge that is a neighbour of its other 
end-point, then the search message travelled a cycle of chosen edges. This cycle- 
detection will succeed as long as there is an unchosen edge to initiate a search 
message that will travel that cycle. Another message type is needed for the case 
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when all edges are initially chosen and hence no search messages are generated. 
A find-cycle message is initiated by a chosen edge that timed-out because it did 
not received any search message within an interval equal to its time-out interval. 
In order to avoid initiating find-cycle messages prematurely (when there is a 
search message on its way to this chosen edge) the time-out interval for any 
chosen edge is set to three times its safetime. Like search messages, find-cycle 
messages record the list of chosen edges travelled, so a chosen edge receiving a 
find-cycle message can detect if it is in a cycle of chosen edges. 

The chosen edge that detects a cycle (from either a search or a find-cycle 
message) initiates a remove message that travels the cycle to the edge with 
maximum weight in that cycle, and causes that edge to set it status to unchosen. 

It is critical that the algorithm does not thrash between the procedure that 
changes edge status to chosen because the edge has evidence that the chosen 
set is disconnected, and the procedure that changes edge status to unchosen 
because the edge collected evidence of a cycle in the chosen set. The proof that 
this cannot happen and that the algorithm is correct, is assembled from several 
pieces as will be seen. 



5.2 Algorithm Details 

processors: An edge processor e has an edge identifier (u, v), where u and v 

are the distinct identifiers of its two end-points. Let EID denote the set of edge 
identifiers. Each edge processor e has a weight that is a positive integer, and is 
denoted by wt{e). 

The identifiers of the neighbouring edges of e at its u and v end-points are 
in stable storage and available to e as N{u) and N{v) respectively. 

Each edge processor maintains three variables in unstable storage: 

— A boolean chosenstatus, which indicates whether or not the edge processor 
e currently is in the ChosenSet subgraph. 

— A non-negative integer timer in the interval [0,3 * safetime (e)], where 
safetime(e) is an upper bound on the time required for a message sent by e 
to travel any simple path in the network (necessarily of length at most n). 

— A boolean searchsent, which indicates whether edge processor e has sent a 
search message that has not yet returned to e. 

messages: Search messages have 3 fields (“search”, eid , path) where eid is 

a member of EID and path is a list of pairs where each pair is a member of 
EID and a weight. The second field records the unchosen edge that initiates the 
search, and the third field records the path of chosen edges travelled by the search 
message and those edges’ weights. Remove messages ( “remove” , path), and Find- 
cycle messages ( “find-cycle” , path), each have two fields with the second field 
recording a path of chosen edges and weights. 

protocol: Algorithm Self_Stabilizing_MST employs two procedures for edge 

(u,v) to send a message. The procedure send{mess , Cneigh) sends the message 
mess to the neighbouring edge processor with identifier Cneigh ■ (The send aborts 
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if Gneigh IS not a neighbouring edge of (m, v)). The procedure propagate{mess , v) 
sends a copy of message mess to all edge processors in N{v). The function reset- 
timer causes an edge processor with chosen_status false to reset its timer to 
its safetime and one with chosen_status true to reset its timer to 3 times its 
safetime. Timers continue to decrement and cause a time-out when they reach 
zero. For list = {xi,X 2 ,--- ,x„), head{list) = x\ and tail{list) = (x 2 ,... ,Xn). 
The function max — weight{list) returns the maximum weight of edges in the 
list. The symbol © denotes concatenation. Comments are delimited by brace 
brackets ({,}). 

Algorithm Self Stabilizing MST 
Procedure for edge processor e = (u, v) : 

Upon time-out : 

1. If {-> chosen status) A {-> search sent) Then 

2. propagate ( (“search”, (■u,u),0) ,u); 

3. searchsent ^ true; 

4. Elseif ( -> chosen status) A {searchsent) Then 

{disconnected chosen edges} 

5. chosenstatus true; 

6. Elseif {chosenstatus) Then {no searches happening} 

7. propagate ( (“find_cycle”, [(u, u)]) ,w); 

8. reset_timer. 



Upon receipt of (“search”, sender, path) from end-point, say u 

9. If {chosenstatus) A ( sender yf {u,v)) Then 

10. reset_timer; 

11. If (V(u, z) G N{v), (v, z) ^ path) Then {no cycle} 

12. propagate (( “search” , sender, pat/i © (m, w) ), u) ; 

13. Else {3{v,z) £ N{v) s.t. (v,z) G path, so cycle of chosen} 

14. Let path = list\ © list 2 where head(?ist 2 ) = {v, z) 

15. send ( (“remove”, Hst 2 ) , t)); 

16. Elseif {->cho sen status) A {sender = {u,v)) Then 

{search traversed a fnd_cyl} 

17. reset_timer; searchsent £- False; 

18. If ( max_weight(paf/i) > wt{e) ) Then 

19. chosenstatus ■£- true; 

20. send( (“remove” ,pat/i) , iie&d {path) ) . 



Upon receipt of (“remove”, path) 

20. reset_timer; 

21. If path is simple Then 

22. If ( wt{e) yf max_weight(paf/i)) Then 

23. send^(“remove” , tail(pat/i)) , head(tail 

24. Else 

25. chosenstatus ■£- False; 

26. searehsent £- False. 



{not heaviest edge} 
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Upon receipt of (“find_cycle”, pof/i) from end — point, say u 

27. reset_timer; 

28. If (chosen_status) Then 

29. If (V(t, z) G N{v), {v, z) ^ path) Then 

30. propagate (( “find_cycle” © {u,v)),v); 

31. Else {3(n, t) G N{v),{v,z) G path} 

32. Let path = list\ © list 2 where head(?ist 2 ) = {v, z) 

33. send ((“remove” , ^ 5 ^ 2 )) (f) t))) . 

5.3 Correctness of Self Stabilizing MST 

We prove that if algorithm Self_Stabilizing_MST is executed from any initial 
configuration, then eventually, those edges with chosenstatus = true will be ex- 
actly the edges of the minimum spanning tree and will subsequently not change. 
Consider any execution of algorithm Self_Stabilizing_MST proceeding in steps 
that are determined by a weakly fair scheduler. At step z, the processor chosen 
by the scheduler executes its next atomic action. 

Let local_state(e, i) be the sequence of chosen_status(e), timer(e) and the 
collection of messages at e at step z. At any step z, the important attributes of 
the state of the entire system is captured by the configuration at step i, denoted 
Config(i), and defined by: 

Config{i) = (local state{ei,i), local state{e 2 , i), ■■ ■ , local state(em,i)) 

Define ChosenSet(i) = {e | In Config(i), chosenstatus(e) = true}. A c-cycle 
is a cycle of edges each of which has chosen_status(e) true. 

We consider the behaviour of the network after the initial, spurious messages 
have been “worked out” of the system. Call a search message (“search”, sender, 
path), genuine if 1) it was initiated by an unchosen edge with edge identifier 
equal to sender, and 2) path is a non-cyclic path of edges starting at sender. 
Define genuine find_cycle messages similarly. A remove message is genuine if it 
was generated in response to a genuine search or find_cycle message. Let M be 
the maximum safetime for any edge in the system. Define step Si to be the first 
step that occurs after time M * i. 

Lemma 1. By step Si all messages are genuine. 

Proof. By the definition of safetime, any Search, Find_cycle or Remove message 
that survives for M time must have travelled a path of length more than n. 
However Search and Find_cycle message stop when a cycle or a fundamental 
cycle is detected, which must happen before n edges are traversed. A Remove 
message is discarded if its path contains repeated nodes. Otherwise it can travel 
at most along the edges in path, of which there are at most n. ■ 

Lemma 2. Let V QV such that the subgraph of (V, Chosen Set (i)) induced by 
V is connected and i> S\. Then V step j > i, the subgraph of(V, ChosenSet(j)) 
induced by V is connected. 




204 L. Higham and Z. Liang 



Proof. Algorithm SelLStabilizingJVIST removes an edge from Chosen_Set(i) 
only upon receipt of a remove message (line 25). Since i > Si such a message 
is genuine, and hence was created when a c_cycle was discovered. This message 
can remove only the unique edge with maximum weight in that c_cycle. So at 
most one edge can be removed from the c .cycle. Thus V must remain connected 
by chosen edges. ■ 

Lemma 3. For any initial Config{0), V steps i > S3, the subgraph 
{V, ChosenSet{i)) is eonnected and spans the network. 

Proof. Assume Chosen_Set(i) is disconnected or does not span G for all i, < 
i < S3. Let (Fi, V2) be any two subsets of V such that the edges on paths between 
Vi and V2 are unchosen for the interval from Si to S3. By step S2, each of these 
unchosen edges will have timed-out, sent a search message and set search-sent to 
true. By step at most S3 none will have had its search message that it initiated 
returned to its opposite end-point, so each will have timed-out again. Because 
search-sent is true each will change its chosen status to true. By Lemma 0 once 
Vi and V2 are connected they remain connected. ■ 

It is easy to check that if the network is itself a tree, then by step S2 all edges 
are chosen and will remain so. Thus SelLStabilizing_MST is correct of any tree 
network. The remainder of this proof assumes that the network G is not a tree. 

We introduce thelatent status to capture any edge that has a Remove message 
destined for it. More precisely, define the latentstatus of edge e by: At step 0, 
for Ve, latent.status(e) is false. latent.status(e) becomes true at step i if, at step 
i, an unchosen edge e' changes its chosenstatus to true because of receipt of 
a search message (“search”, sender, path) where sender = e' and e is the edge 
with maximum weight in path. latent_status(e) becomes false at step j if, at 
step j, edge e changes its chosen^tatus to false because of receipt of a remove 
message (“remove”, path) where e is the maximum weight edge of path. Let 
Latent-Set{i) = {e | In Config(i), latentstatus{e) = true} 

Lemma 4. If G is not a tree, then for all steps i > S3, 

Ghosen_Set{i) \ Latent_Set{i) is a proper subset of E. 

Proof. Since Ghosen_Set{i) is connected for any i > S3, by Lemma El the only 
way for an unchosen edge to become chosen is line 19, which also adds an edge to 
Latent_Set. The only way for Latent.Set to lose an edge is if that edge changes 
to unchosen. 

Suppose Ghosen_Set{3M) = E and Latent _Set{‘iM) = 0. Since G is not 
a tree, there exists at least one c.cycle. If there is any search message in the 
network, it will propagate to this c.cycle and generate a remove message destined 
for the edge with maximum weight, say e, in that c.cycle by some step i < ^4. So 
e G Latent.Set{i). Otherwise, some edge of the c.cycle will time-out within time 
3M, generate a find.cycle message, and detect the cycle in at most M additional 
time. So by some step j < S3, e € Latent _Set{j). 

Therefore, once GhosenSet{i) \ LatentSet(i) is a proper subset of E for 
i > S3, it remains a proper subset for all j > i. ■ 
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Lemma 5. If G is not a tree, no chosen edge can time out after Ss- 

Proof. By Lemma 0 there is always an unchosen edge or a latent edge in 
the network. Once a chosen edge has reset its timer, a latent edge will become 
unchosen and an unchosen edge will time-out and propagate a search message, 
before a chosen edge can again time-out, because the chosen edge sets its timer 
to 3 times that required for any message to reach it. Because the chosen set is 
connected, the search message will reach any chosen edge and will cause it to 
reset its timer before timing out. ■ 

Let d, 62 , . . . , 6 m be the edges of G sorted in order of increasing weight. Let 
E be the subset of E consisting of edges of MST{G). Define fc(s) be the smallest 
integer in {!,... ,m} such that, Vi > fc(s), € GhosenSet{s) if and only if 
Ci € E. Observe that fc(s) is the index of the maximum weight edge such that the 
predicate (6^(5) G Chosen_Set{s)) differs from (6^,(3) G E). The proof proceeds 
by showing that after step S'g, we have the safety property that the index fc(s) 
never increases, and the progress property that integer k{s) eventually decreases. 
Then we will be able to conclude that eventually k{s) must be 0, implying that 
the chosen set is the edge set of MST. All the remaining lemmas are implicitly 
intended to apply after step iSg. 

Lemma 6. For all e in E, if e G GhosenSet(s) then e G GhosenSet(s') 
Vs' > s. 

Proof. The only way that edge status changes from chosen to unchosen is by 
receipt of a remove message, which can only remove the edge with maximum 
weight in some cycle of the network. However, by Proposition 0 no such edge 
can be in E. ■ 



Lemma 7 . E G Ghosen_Set{s). 

Proof. By Lemma El for every e G E f] e G ChosenSet{s), e stays in 
GhosenSet(s') for s' > s. If 3e G if and e ^ ChosenSet{s), e = (u,v) will 
time out and initiate its search message at one end-point, say u. By Lemma 
Chosen_Set{s)) is connected and spans the network. So some copy of e’s 
search message will return to its v end-point. By Proposition ^ there must exist 
e' in the path travelled by the search message, with larger weight than e. So e 
becomes a chosen edge. ■ 

Lemma 8. fc(s) is non-increasing. 

Proof. By the definition of k(s), if z > k(s) and G GhosenSet{s), then 
6 i G E. By LemmaEI for every s' > s Ci G GhosenSet{s'). 

By the definition of fc(s). If z > k{s) and ^ Ghosen_Set{s) then ^ 
E. Suppose 3s' > s, G GhosenSet{s'). Only line 5 or line 19 can change 
chosen -.status of from unchosen to chosen. Line 5 is impossible after time 
S 3 by LemmaEI So Ci received its own search message (“search”, e,, path) 
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indicating the maximum weight in path is a chosen edge Cj other than e^. Since 
weight(ej) > weight(ei), j > i because indexes of edges are by increasing weight. 
Hence j > k(s) and thus Cj is chosen implies ej S E. But ej is also a maximum 
edge in a cycle, contradicting Proposition Q ■ 

Lemma 9. If k{s) > 1 , then 3s' > s such that k{s') < k{s). 

Proof. By the definition of fc(s), 6^(5) S ChosenSet{s) if and only if 6^(5) ^ E. 
By Lemma[3 E C ChosenSet{s), so chosenstatus{ei^(^s)) = true in Config(s) 
and efc(s) ^ E. Consider the unique fnd-cyl{E,ek(s)) = (efc(s)ieo,j, , Ca,) of 
edges in E and the edge ek(s) ■ Again by Lemma 0 Cqh G ChosenSet{s) for 
each 1 < i < /. So 6^(5), Cci, ... ,60,, is a cycle of chosen edges in Config(s). In 
this cycle efe(g) must be heaviest, because by Proposition Q] no edge in E can 
be the heaviest of any cycle of G. This cycle will be detected by some search 
message that traverses it, and 6^(3) will be removed from the chosen set at 
some subsequent step s'. Thus in Config(s'), chosen_stotMs(efc(s)) =false. So by 
Lemma 0 fc( s') < k{s). ■ 

Lemma 10. If (V, Chosen -Set{s)) is a minimum spanning tree, then for all 
s' > s, {V,ChosenSet{s')) is a minimum spanning tree. 

Proof. When ChosenSet(s) is a minimum spanning tree, there is no cycle of 
chosen edges and there is a path of chosen edges between every pair of vertices. 
Every unchosen edge is the largest edge in the cycle consisting of itself and 
the path of chosen edges between its end-points. So in SelLStabilizingJVlST, 
the unchosen edges keep timing out and sending search messages. Each search 
message returns to its initiator with the information that the unchosen initiator 
is the edge with maximum weight in the path traversed. So the unchosen edge 
is passive until it next times out. ■ 

Lemmas and cni combine to show that our algorithm is correct. 

Theorem 1. Algorithm SelfStabilizing-MST is a self-stabilizing solution for the 
minimum spanning tree problem on message-passing networks. 

6 Further Comments and Future Work 

The presentations of algorithms Basic_MST and Self_Stabilizing_MST were sim- 
plified by describing them from the “edge-processor” perspective where edges 
rather than nodes were assumed to drive the computation. To convert to the 
network model, we select one of the end-points of each edge to simulate the 
edge-processor. This selection could be done by simply choosing the end-point 
with the largest identifier, or the work could be spread more evenly by more care- 
fully tuning this assignment. Notice that under the edge-processor descriptions, 
each edge is assumed to have access to the edge-identifiers of the edges incident 
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on either of its end-points. Thus, when end-point u simulates the computation 
of edge-processor (u,v), u will need information about the edges adjacent to v. 
Typically, this information is not directly available to u. So each node uses a 
self-stabilizing local update algorithm to gather information on the topology up 
to a radius of two. The final algorithm is constructed from the fair composition 
of the local update algorithm and the end-point simulation of the edge-processor 
algorithm. 

We have advertised the Internet as a possible application for our algorithms. 
This needs to be defended because, as was pointed out by an anonymous referee, 
the Internet is so large that at any time there is likely to be some fault in it. 
Thus, it is highly unlikely that following a fault there would be a fault-free inter- 
val as least as big as the stabilization time, especially given the large worst-case 
time to stabilization of our algorithms. However, the repairs in Basic_MST and 
Self_Stabilizing_MST proceed in a distributed fashion and are typically quite in- 
dependent. A fault in one part of a large network will only effect those parts of 
the spanning tree that might “see” that fault. For example, suppose a spanning 
tree edge fails by erroneously becoming unchosen due to a fault, or disappearing 
due to a dynamic network change. Then, only those spanning tree paths that 
use this edge are affected. In the Internet most connections follow approximately 
physically direct paths (routing from New York to Boston does not go via Lon- 
don), and due to caching and replication, a great deal of the traffic is relatively 
local. Thus most of the network spanning tree will continue to function without 
noticing the fault. Similar arguments can be made to defend the behaviour of 
the algorithms as adequate in the case of other faults or dynamic changes. Be- 
cause there is no dependence on a root that must coordinate the revisions, the 
repairs to the spanning tree can proceed independently and “typically” locally. 
Note however, that there is no general local detection and correction claim that 
can be proved. 

Setting the safetime for each edge could present a challenging trade-off in 
some cases. Before stabilization, time-outs trigger some essential repair mecha- 
nisms in situations that would otherwise be deadlocked. After stabilization, no 
errors are detected so a processor does nothing until a time-out causes it to 
restart error detection. Thus inflated safetimes slow convergence in some situ- 
ations, but reduce message traffic after stabilization. As can be seen from the 
algorithm, the unstable configurations that trigger time-outs are quite special- 
ized and may be highly unlikely to occur during stabilization in some applica- 
tions. In this case it may be advantageous to choose rather large values. Further 
work including some simulation studies are necessary to determine appropriate 
safetimes for particular applications. 

Our algorithms permit different safetime settings for each edge, which may 
be convenient since agreement does not need to be enforced. However, we we 
do see how to exploit this possibility for efficiency while strictly maintaining 
self-stabilization. If faults caused by premature time-outs are tolerable, it may 
be reasonable to set safetime for each edge so that it is only likely to be safe 
rather than guaranteed. In this case it may be useful to exploit the possibility 
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of different time-out intervals for different edges, by setting them to reflect the 
time required for a message to travel the edge’s fundamental cycle instead of 
any simple path in the network. 

Both algorithms create a lot of messages some of which can be very long. 
The algorithms will remain impractical until these problems are addressed. 

We are grateful for the thoughtful comments and careful readings by anony- 
mous referees as provided by both the DISC and the WSS program committees. 
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Abstract. Distributed queuing is a fundamental problem in distributed 
computing, arising in a variety of applications. In a distributed queuing 
protocol, each participating process informs its predecessor of its identity, 
and (when appropriate) learns the identity of its successor. 

This paper presents a new, self-stabilizing distributed queuing proto- 
col. This protocol adds self-stabilizing actions to the Arrow distributed 
queuing protocol, a simple path-reversal protocol that runs on a network 
spanning tree. 

The protocol is structured as a layer that runs on top of any self- 
stabilizing spanning tree protocol. This additional layer stabilizes in con- 
stant time, establishing that self-stabilizing distributed queuing is no 
more difficult than self-stabilizing spanning tree maintenance. The key 
idea is that the global predicate defining the legality of a protocol state 
can be written as the conjunction of many purely local predicates, one 
for each edge of the spanning tree. 



1 Introduction 

In the distributed queuing problem, processes in a message-passing network asyn- 
chronously and concurrently place themselves in a distributed logical queue. 
Specifically, each participating process informs its predecessor of its identity, 
and (when appropriate) learns the identity of its successor. 

Distributed queuing is a fundamental problem in distributed computing, aris- 
ing in a variety of applications. For example, it can be used for scalable ordered 
multicast uni, to synchronize access for mobile objects mi, distributed mutual 
exclusion (by passing a token along the queue) , distributed counting (by passing 
a counter), or distributed implementations of synchronization primitives such as 
swap. 

The Arrow protocol is a simple distributed queuing protocol based on 

path reversal on a network spanning tree. This protocol has been used to manag- 
ing mobile objects (by queuing access requests) in the Aleph Toolkit 0, where it 
has been shown to significantly outperform conventional directory-based schemes 
under high contention PI. A recent theoretical analysis ^ has shown it to be 
competitive with the “optimal” distributed queuing protocol under situations of 
high contention. 

The Arrow protocol is not fault-tolerant, because it assumes that nodes and 
links never fail. In this paper, we explore one approach to making the Arrow pro- 
tocol fault-tolerant: self-stabilization jS]. Informally, a system is self-stabilizing 
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if, starting from an arbitrary initial global state, it eventually reaches a “legal” 
global state, and henceforth remains in a legal state. 

Self-stabilization is appealing for its simplicity. Rather than enumerate all 
possible failures and their effects, we address failures through a uniform mecha- 
nism. Our self-stabilizing protocol is scalable: each node interacts only with its 
immediate neighbors, without the need for global coordination. 

Of course, self-stabilization is appropriate for some applications, but not 
others. For example, one natural application of distributed queuing is ordered 
multicast, in which all participating nodes receive the same set of messages 
in the same order. A self-stabilizing queuing protocol might omit messages or 
deliver them out of order in the initial, unstable phase of the protocol, but would 
eventually stabilize and deliver all messages in order. Our protocol is appropriate 
only for applications that can tolerate such transient inconsistencies. 

The key idea is that the global predicate defining the legality of a protocol 
state can be written as the conjunction of many purely local predicates, one for 
each edge of the spanning tree. We show that the delay needed to self-stabilize the 
Arrow protocol differs from the delay needed to self-stabilize a rooted spanning 
tree by only a constant. Since distributed queuing is a global relation, it may 
seem surprising that it can be stabilized in constant additional time by purely 
local actions. 

We note that the protocol is locally checkable 0 and we could use the general 
technique devised by 0 to correct the state locally. But this would lead to a 
stabilization time of the order of the diameter of the tree, where as our scheme 
gives a constant stabilization time. 

2 The Arrow Protocol 

The Arrow protocol was introduced by Kerry Raymond in H21 and later used 
by Demmer and Herlihy in Pj to manage distributed directories. We now give a 
brief and informal description of the Arrow protocol. More detailed descriptions 
appear elsewhere mm- The protocol runs on a fixed spanning tree T of the 
network graph. Each node stores an “arrow” which can point either to itself, or 
to any of its neighbors in T. If a node’s arrow points to itself, then that node is 
tentatively the last node in the queue. Otherwise, if the node’s arrow points to 
a neighbor, then the end of the queue currently resides in the component of the 
spanning tree containing that neighbor. Informally, except for the node at the 
end of the queue, a node knows only in which “direction” the end of the queue 
lies. 

The protocol is based on path reversal. Initially, one node is selected to be 
the head of the queue, and the tree is initialized so that following the arrows 
from any node leads to that head. To place itself on the queue, a node v sends a 
find{v) message to the node indicated by its arrow, and “flips” its arrow to point 
to itself. When a node x whose arrow points to u receives a find{v) message 
from tree neighbor w, it immediately “flips” its arrow back to w. If u ^ x, then x 
forwards the message to u, the prior target of its arrow. If u = a; (a: is tentatively 
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the last node in queue), then it has just learned that v is its successor. (In many 
applications of distributed queuing, x would then send a message to v, but we 
do not consider that message as a part of the queuing protocol itself.) 

3 Model 

We assume all communication links are FIFO, and that message and processor 
delays are bounded and known in advance. In particular, a node can time out if it 
is waiting for a response. If the time out occurs, no response will be forthcoming. 
(Gouda and Multari (3 have shown that such a timeout assumption is necessary 
for self-stabilization.) 

Self-stabilizing protocols can be built in a layered fashion ini- The proto- 
col presented here is layered on top of a self-stabilizing rooted spanning tree 
protocol In this paper, we focus only on the upper layer, assuming that 

our protocol runs on a fixed rooted spanning tree. We show how to stabilize the 
arrows and the find messages. 

The rest of the paper is organized as follows. Section 21 lays down the for- 
mal definitions of what it means to be a legal state and what are the possible 
initial states. Section 0 gives the key ideas and an informal description of the 
protocol. The full protocol is presented in Sect. Eland Sect. 0 contains a proof 
of its correctness and a discussion of stabilization time. Section 0 contains the 
conclusions. 



4 Local and Global States 

Initially, each node is in a legal local state (for example, integer variables have 
integer values), but local states at different nodes can be inconsistent with each 
other. Network edges can hold a finite number of messages. The algorithm exe- 
cuting at a node is fixed and incorruptible. 

Recall that an underlying self-stabilizing protocol yields a rooted spanning 
tree T which we treat as fixed. Every node knows its neighbors in the spanning 
tree. As described above, in the standard Arrow protocol, each node v has a 
pointer denoted by p{v). Nodes communicate by find messages. 

A global state of the protocol consists of the value of p{v) for every vertex 
V of T (that is, the orientation of the arrows) and the set of find messages in 
transit on the edges of T. 

It is natural to define a legal protocol state as one that arises in a normal 
execution of the protocol. In the initial quiescent state, following the pointers 
from any node leads to a unique “sink” (a node whose arrow points to itself). 
A node initiates a queuing request by sending a find message to itself. When a 
node V gets a find message, it forwards it in the direction ofp{v) and flips p{v) to 
point to the node where the find came from. If p{v) is v, then the find has been 
queued behind v’s last request. Any of these actions is called a find transition. 
A legal execution of the protocol moves from one global state to the next via a 
find transition. 
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Definition 1. A state is quiescent if following the arrows from any node leads 
to a unique sink and there are no find messages in transit. 



Fig. 1. On the left is the spanning tree T. On the right a legal quiescent state of the 
protocol 



Definition 2. A state is legal either if it is a quieseent state or it ean be reaehed 
from a quiescent state by a finite sequence of find transitions. 



Fig. 2. On the left is a legal state which is not quiescent. On the right is an illegal 
state 

In a possible (illegal) initial state, p{v) may point to any neighbor of v in T, 
and each edge may contain an arbitrary (but finite) number of find messages in 
transit in either or both directions. See Fig. [Hand Fig. El 

5 Local Stabilization Implies Global 

Though the predicate defining whether a protocol state is legal or not is a global 
one, which depends on the values of all the pointers and the finds in transit, we 
show that it can be written as the conjunction of many local predicates, one for 
each edge of the spanning tree. 
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Suppose the protocol was in a quiescent state (no finds in transit). Let e be 
an edge of the spanning tree connecting nodes a and b. e divides the spanning 
tree into two components, one containing a and the other containing b. There 
is a unique sink which either lies in the component containing a or in the other 
component. Since all arrows should point in the direction of the sink, either b 
points to a or vice versa, but not both. 

Now if the global state were not quiescent and there was a find message in 
transit from a to b, it must be true that a was pointing to b before it sent the 
find, but no longer is (the actions of the protocol cause the arrow turn away from 
the direction it just forwarded the find to the direction the find came from), a 
and b both point away from each other when the find is in transit. 

The above cases motivate the following definition. Denote the number of find 
messages in transit on e by F{e). p{a,e) is 1 if a points on e (i.e to b) and 0 
otherwise. p(b, e) is defined similarly. For an edge e, we define 4>{e) by 

(f){e) = p{a, e) + p{b, e) + F{e) 

We say that edge e is legal if </>(e) = 1 (either p{a) = b or p{b) = a or a find is in 
transit, but no two cases can occur simultaneously). 

We now state and prove the main theorem of this section. 

Theorem 1. A protocol state is legal if and only if every edge of the spanning 
tree is legal. 

Proof. Follows from theorems 0 and 0 □ 



Theorem 2. If a protocol state is legal, then every edge of the spanning tree is 
legal. 

Proof. In a quiescent state, there are no finds in transit and we claim that for 
any two adjacent nodes on the tree a and b, either a points to b or vice versa, 
but not both. 

Clearly, a and b cannot both point to each other since we will not have a 
unique sink in that case. Now suppose that a and b pointed away from each 
other. Then we can construct a cycle in the spanning tree as follows. Suppose 
s was the unique sink. Following the arrows from a and b leads us to s. These 
arrows induce paths Pa and ph in the tree, which intersect at s (or earlier). The 
cycle consists of: edge e, Pa and pt. Thus 4>{e) is 1 for every edge e. 

Further, any find transition preserves (f{e) for every edge e. To prove this, 
we observe that a find transition could be one of the following (u is a node of 
the tree). 

(1) V receives a find from itself; it forwards the find to p{v) and sets p{v) = v 

(2) V receives a find from u and p{v) u; it forwards the find to p(v) and sets 
p{v) = u 

(3) V receives a find from u and p{v) = v; it queues the request at v and sets 
p{v) = u. 
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In each of the above cases, it is easy to verify that (j){e) is preserved for every 
edge e. We do not do so here due to space constraints. Since every legal state 
is reached from a quiescent state by a finite sequence of find transitions, this 
concludes the proof. □ 



Theorem 3. If every edge of the spanning tree is legal then the protocol state is 
legal. 

Proof. Let L be a protocol state where every edge is legal. Consider the directed 
graph Al induced by the arrows p{v) in L. Since each vertex in A ^ has out- 
degree 1, starting from any vertex, we can trace a unique path. This path could 
be non-terminating (if we have a cycle of length greater than 1) or could end at 
a self-loop. 

Lemma 1. The only directed cycles in Aj^ are of length one (i.e self loop). 

Proof. Any cycle of length greater than two would induce a cycle in the under- 
lying spanning tree, which is impossible. A cycle of length two implies an edge 
e = (a, b) with p(a) = b and p{b) = a. This would cause 4>{e) to be greater than 
one and is also ruled out. □ 

The next lemma follows directly. 

Lemma 2. Every directed path in Aj^ must end in a self-loop. 

We are now ready to prove the theorem. We show that there exists some 
quiescent state Q and a finite sequence of find transitions seq which takes Q to 
L. Our proof is by induction on k, the number of find messages in transit in L. 

Base case: k = 0, i.e no find messages in transit. We prove that L has a 
unique sink and is a quiescent state itself and thus seq is the null sequence. 

We employ proof by contradiction. Suppose L has more than one sink and si 
and S 2 are two sinks such that there are no other sinks on the path connecting 
them on the tree T. There must be an edge e = (a, b) on this path such that 
neither p{a) = b nor p(b) = a. To see this, let n be the number of nodes on the 
path connecting si and S 2 (excluding si and S 2 ). The arrows on these nodes 
point across at most n edges. Since there are n-l- 1 edges on this path there must 
be at least one edge e which does not have an arrow pointing across it. For that 
edge, 4>{e) = 0, making it illegal and we have a contradiction. 

Inductive case: Assume that the result is true for k < 1. Suppose L had I 
find messages in transit. Suppose a message was in transit on edge e from node 
a to node b (see Fig. EJ. Since </)(e) = 1, a should point away from b and b away 
from a. We know from lemmaElthat the unique path starting from a in must 
end in a self-loop. Let P = a, ui . . . rta,, rt be that path with u having a self-loop. 

Clearly, we cannot have any find messages on an edge in P, because that 
would cause 4> of that edge to be greater than one (an arrow pointing across the 
edge and a find message in transit). Consider a protocol state L' (see Fig. 0 
where u did not have a self-loop. Instead, p{u) = Ux and all the arrows on path 
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Fig. 3. Global State L has a find on e and a self-loop on u 



u 






Fig. 4. Global State L' has one find message less than L. L can be reached from L' 
by a sequence of find transitions 



P were reversed i.e. p{ux) = Ux-i and so on till p{ui) = a. e was free of find 
messages and p{a) = b. The state of the rest of the edges in L' is the same as in 
L. 

We show that the </> of every edge in L' is one. The edges on P and the edge 
e all have (j) equal to 1, since they have exactly one arrow pointing across them 
and no finds in transit. The other edges are in the same state as they were in L 
and thus have (f> equal to 1. 

Moreover, L can be reached from L' by the following sequence of find transi- 
tions seqL',L- u initiates a queuing request and the find message travels the path 
u ^ Ux ^ Ux-i . . . ui — >■ a, reversing the arrows on the path and is currently on 
edge e. 

Since L' has I — 1 find messages in transit and every edge of T is legal in 
L' , we know from induction that L' is reachable from a quiescent state Q by 
a sequence of find transitions seq^r. Clearly, the concatenation of seq^r with 
seqL',L is a sequence of find transitions that takes quiescent state Q to L. □ 



Self Stabilization on an Edge 

Armed with the above theorem, our protocol simply stabilizes each edge sepa- 
rately. Stabilizing each edge to a legal state is enough to make the global state 
legal. Nodes adjacent to an edge e repeatedly check <j){e) and “correct” it, if 
necessary. 

The following decisions make the design and proof of the protocol simpler: 

— The corrective actions to change (p{e) are designed not to change (j){f) for 
any other edge /. This is a crucial point so that now the effect of corrective 
actions is local to the edge only and we can prove stabilization for each edge 
separately. 

— Out of the two adjacent nodes to an edge e, the responsibility of correct- 
ing (j)(e) rests solely with the parent node (parent in the underlying rooted 
spanning tree T). The child node never changes 4>{e). 
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If the value of </>(e) could be determined locally at the parent, then we would 
be done. The problem though, is that (/)(e) depends on the values of variables 
at the two endpoints of e and on the number of find messages in transit. This 
can be computed by the parent after a round trip to the child and back, but the 
value of </>(e) might have changed by then. 

The idea in the protocol is as follows. The parent first starts an “observe” 
phase when it observes </>(e). It does not change (j>{e) during the observe phase. 
Since the child never changes 4>(e) anyway, 4>{e) remains unchanged when the 
parent is in the observe phase. The parent follows it up with a “correct” phase 
during which it corrects the edge if it was observed to be illegal. 

The corrective actions are one of the following. We reemphasize that these 
change 4>{e) but don’t change (f> of any other edge of the spanning tree, a is the 
parent and b is the child of e. 

(1) If 0(e) is 0, inject a new find message onto e (without any change in p{a)), 
increasing 0(e) to one. 

(2) If 0(e) > I, and p{a) = b, then reduce 0(e) by changing p{a) = a. 

(3) If 0(e) > I but p{a) yf b, then there must be find messages in transit on 
e. We show that eventually these find messages must reach a which can reduce 
0(e) by simply ignoring them. 

It remains to be explained how the parent computes 0(e). At the start of the 
observe phase, it sends out an observer message which makes a roundtrip to the 
child and back. Since the edges are FIFO, by the time this returns to the parent, 
the parent has effectively “seen” the number of finds in transit. The observer 
has also observed p{b) on its way back to the parent. The parent computes 0(e) 
by combining its local information with the information carried back by the 
observer. Once the observer returns to the parent, it enters a correct phase and 
the appropriate corrective action is taken. 

To make the protocol self-stabilizing, we start an observe phase at the parent 
in response to a timeout and follow it up with a correct phase. The timeout is 
sufficient for two roundtrips from the parent to the child and back. If we have an 
observe phase followed by a “successful” correct phase, then the edge would be 
corrected, and would remain legal thereafter. Each observe phase has an “epoch 
number” to help the parent discard observers from older epochs, or maliciously 
introduced observers. 



6 Protocol Description 

In this section, we describe the protocol for a single edge e connecting nodes a 
and b where a is the parent node. 

States and Variables: 

Node a has the following variables. 

(1) p{a) is a’s pointer (or arrow), pointing to a neighbor on the tree or to 
itself. The rest are variables added for self-stabilization: 

(2) state, is boolean and is one of observe or correct. 
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(3) sent is an integer and the number of finds sent on e since the current 
observe phase started. 

(4) epoch is an integer which is the epoch number of the current observe 
phase. 

(5) V is an integer and is a’s estimate of (f>{e) when it is in a correct state. 
The only variable at b is the arrow, p{b). 

Messages: There are two types of messages. One is the usual find message. 
The other is the observer message, which a uses to observe </>(e). In response to a 
timeout, a increments epoch and sends out message observer(epoch), indicating 
the start of observe epoch epoch. Upon receipt, b replies with observer(c,p(b,e)). 

Transitions: The transitions are of the form (event) followed by (actions). 
A timeout event occurs when a’s timer exceeds twice the maximum roundtrip 
delay from a to 6 and back. The timer is reset to zero after a timeout. 

Transitions for a (the parent). 

— Event: Timeout 

-Reset state to observe, sent to 0 and increment epoch {the epoch number}. 
-Send observer (epoc/i) on e. 

— Event: {.state = observe) and (receive find from b) 

- If {p{a) = a) then set p{a) ^ b and the find is queued behind the last 
request from a. If (p(a) yf a) and {p{a) yf 6), then forward the find to p{a) 
and p{a) •<— b. {the normal Arrow protocol actions.} 

- If {p{a) = b), send the find back to b on e; increment sent. 

— Event: {state = correct) and (receive find from b) {Eventually, state = 
correct implies that v = 4>{e) } 

-If u > 1, then ignore the find; decrement v {since 4>{e) has decreased} 

-Else (if p{a) = b) send the find back to b {this situation would not arise in 
a legal execution}. 

-Else, normal Arrow protocol actions. 

— Event: {state = observe) and (receive observer(d,x) on e) 

-If epoch yf d then ignore the message. {This observer is from an older epoch 
or is spurious}. 

-Else change state to correct, v = sent + x+p{a,e). {This is a’s estimate of 
4>{e), and is eventually accurate}. 

Take corrective actions (if possible) . 

- If u = 0 then send find to a; increment v. 

- If (u > I and p{a) = b), then p{a) a and decrement v. 

— Event: (receive find from node u yf 5 on an adjacent edge) and {p{a) = b) 

- normal Arrow protocol actions; increment sent {since a find will be sent 
on e}. 

Actions for b (the child). 

— Event: receive find from a. 

-If p{b) = a, then send find back to a. 

-Else, normal Arrow protocol actions. 

— Event: receive observer(c). {a wants to know p{b,e).} 

-Send observer(c,p(b,e)) on e. 
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7 Correctness Proof 

We prove two properties for every edge. The first is closure: if an edge enters 
a legitimate state then it remains in one. The second is stabilization: each edge 
eventually enters a legitimate state. We prove these properties with respect to a 
stronger predicate than = 1, since in addition to the arrows and find mes- 
sages being legal, we will need to include the legality of the variables introduced 
for self-stabilization. 

Each observer has a count (or sequence number) . sent is a counter at a which 
is reset to zero at the beginning of every observe phase and incremented every 
time a sends out a find message on e. The current epoch number at a is epoch. 

We visualize the edge as a directed cycle (see Fig. EJ, with the link from a 
to b forming one half of the circumference and the link from 5 to a the other 
half. The position of the observer(s), the find messages in transit, the nodes a 
and b are all points on this cycle. Messages travel clockwise on this cycle and 
no message can “overtake” another (FIFO links). Let R denote the maximum 
roundtrip time of a message from a to b and back, a times out and starts a new 
observe phase after time 2R. 



observer 




Fig. 5. The edge is a cycle. Messages findl and find2 belong to Fai and the other three 
hnds to Fla 



Suppose there was only one “current observer”, i.e. an observer whose count 
matched the current epoch number (epoch) stored at a. Let F denote all the find 
messages in transit on e. We can divide F into two subsets: Fai, find messages 
between a and the current observer and Fia, the find messages between the 
current observer and a. Clearly, |Fai| -I- \Fia\ + p(a,e) +p(b,e) = 4>{e). If the 
observer is between b and a, it contains the value of p(b, e) as observed when it 
passed b. We denote this value by Pobs(b,e)- 

Predicate 1. The current observer is between a and b, and (j)(e) = sent + 
p(a,e)+p(b, e) + |F]a|. 



Self Stabilizing Distributed Queuing 219 



Predicate 2. The current observer is between b and a, and (j){e) = sent + 
p{a,e) +Pobsih, e) + \Fia\. 



Lemma 3. Suppose a was in an observe phase and there was only one current 
observer. If predicate Q is true to start with, and a does not time out until the 
observer returns back, then the following will be true of the observer’s trip back 
to a. 

(1) When the observer is between a and b, predicate^ will remain true. 

(2) After the observer crosses b and is between b and a, predicate^ will be true. 

(3) When the observer returns to a, a enters a correct state where v = 4>{e). 

Proof. Until the observer returns to a, it will remain in an observe phase, and 
(j){e) will not change. Recall that (j){e) = |F| + p{a, e) + p{b, e). 

As long as the observer hasn’t reached b, every find in Fai must have been 
injected by a after the observer left a and thus Fai = sent. We have </)(e) = 
p{a, e) +p{b, e) + \Fia\ + \Fai \ = p(a, e) +p{b, e) + \Fia\ + sent, satisfying predicate 
n This proves part (1). 

Suppose the observer is just about to cross b. We have (j){e) = sent+p{a, e) + 
p{b,e) + \Fia\. Immediately after the observer crosses b, we have Pobs{b,e) = 
p{b, e). Since <f(e) has not changed and none of the other quantities p(a, e), \Fia\ 
have changed in the meanwhile (think of the observer crossing b as an atomic 
operation) (j){e) = sent + p{a, e) +Pobs{b, e) + \Fia\ after the observer has crossed, 
and predicate 2 is true. 

We now prove by induction over the size of |T)a| that predicate 0 continues 
to hold. Suppose it was true when \Fia\ was k. If \Fia\ decreases to fc — I, then 
a find must have been delivered to a. If p{a, e) was 1, then the find would have 
bounced back on e and sent would have increased by 1. If p{a,e) was zero (a 
was pointing away from b), then p(a,e) would increase to 1 and sent would 
remain the same. In either case, the sum sent + p{a, e) would increase by 1, and 
\Fia \ + sent + p{a,e) would remain unchanged. This proves part (2). 

When the observer reaches a again, Fia will be the empty set and we thus 
have </>(e) = sent + p{a, e) + Pobs{b, e). Once the observer reaches a, a will enter 
a correct state and sets v to the above quantity {sent + p{a, e) +Pobs{b, e)). This 
proves part (3). □ 

We will now define the set of legitimate states for an edge, this time including 
the variables introduced for self-stabilization as well. 

— Ri denotes the predicate: (j){e) = 1. 

— We denote the AND of the following predicates by i? 2 - 

(1) o’s state is observe 

(2) there is exactly one current observer 

(3) the other observers have counts less than epoch (the current epoch num- 
ber at a) 

(4) predicates □ or 121 should be satisfied 



220 M. Herlihy and S. Tirthapura 



— We denote the AND of the following predicates by R 3 . 

(1) a’s state is correct 

(2) there is no observer with a count greater than or equal to epoch 

(3) V = 4>{e) (i.e, a knows </'(e)) 

Since a can be in either the observe state or in the correct state, but never 
both at the same time, only one of i?i or R 2 can be true at a time. 

Definition 3. The edge is in a legitimate state iff the following is true: R\ A 
(i?2 V Rf). 

To prove self-stabilization of the protocol, we first prove stabilization and 
closure for the predicate i ?2 V Rz and then for the predicate R\ A (i ?2 V Rz). This 
technique has been called a convergence stair in 

Lemma 4. If R 2 V Rz is true, then it will continue to remain true. 

Proof. We consider two cases and all the possible actions that could occur in 
each case. 

(1) Rz is true. As long as a is in the correct state, it will not introduce 
any new observers, v = (fie) to start with; any changes to ^(e) are made at a 
and are also reflected in v, thus v will remain equal to </>(e). Suppose a times 
out and enters an observe state, it increments epoch and sends out an observer 
with sequence number epoch. There is only one current observer and it satisfies 
predicated trivially. R 2 is true now and so is i ?2 V i? 3 . 

(2) i ?2 is true. If the observer does not reach a and a does not time out, then 

the observer remains on the edge e and predicate d or predicate d will continue 
to hold due to lemma 0 If a times out before the observer reaches it, then it 
will enter an observe state, epoch is incremented, and a new observer is injected 
into the channel with sequence number epoch. R 2 is still true (only one current 
observer; no observers with sequence number greater than epoch; predicate dis 
true). If the observer reaches a before it times out, then a will go to a correct 
state and by lemmad v = (fie) at a. Thus Rz is true at a. □ 

Lemma 5. Closure: If R\ A (i ?2 V Rz) is true, then it will continue to remain 
true. 

Proof. From lemma 0 we know that R 2 V Rz will continue to remain true. 

We have to show that Ri will continue to hold. Since Ri is true initially, we 
have (fie) = 1 to start with. If we can show that (fie) is never changed, then we 
are done. 

If a is in the observe state (i ?2 is true), then (fie) is never changed. If a is in 
the correct state (i ?3 is true), we have v = (fie) = 1 (by predicate Rz). If u = I, 
then a will not take any corrective action, and thus (fie) is never changed. □ 

Definition 4. A state is fresh if a has just timed out, and thus the next time 
out is 2R time steps away. It is half-fresh if the next time out is at least R time 
steps away. 
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Lemma 6. Within 3i? time of any state, we will reach a state where R 2 is true 
and the state is fresh. 

Proof. If there are any observers in transit with sequence numbers greater than 
epoch, then they will reach a within time R (the roundtrip time). All the ob- 
servers that a injects have sequence numbers less than or equal to epoch. Clearly, 
after time R there will never be an observer with sequence number greater than 
epoch. 

Within time 2R after that, a will time out, enter an observe state, increment 
epoch and send out a new observer resetting r to zero. Clearly i ?2 is now true; 
there is only one current observer, the other observers have sequence numbers 
less than epoch, and predicate 1 is true. Since a has just timed out, the state is 
fresh. □ 



Lemma 7. Starting from any state, within 4R time we will reach a state where 
i ?3 is true and the state is half-fresh. 

Proof. From lemma EJ we know that i ?2 will be true within time 3R. Thus a is in 
an observe state and its current observer obeys predicates Doris If the observer 
reaches a before a times out (going into an observe state of a later epoch), then 
a will enter a correct state. And by lemma 0 v = ^(e). Thus R 3 will be true at 

a. 

Since the state we started out is fresh, the next time out will occur only after 
time 2R. Since R is the maximum roundtrip time, the observer will indeed reach 
a within time R (before the timeout), and the next time out is at least R away. 
Thus the state is half-fresh. □ 



Lemma 8. Stabilization: In time 5R we will reach a state where R\ A (i ?2 V i?a) 
is true. 

Proof. From lemma |7| within time 4R, we are in a state where R 3 (and thus i ?2 V 
i?s) is true and the state is half-fresh. From lemmaD we know that R 2 V i ?3 will 
remain true after that. We now show that within R more time steps, predicate 
i?i will also be true. 

If i ?3 is true then v = 4>{e). If (j>{e) = I, then i?i is already true. If (j){e) = 0, 
then a will increase (f{e) immediately and i?i will be true. If (j){e) > 1 and 
p{a) = b, then a reduces 4 > by setting p(a) = a. 

We are now left with the case when cj){e) > 1 and p{a) b. Since 4>{e) = 
p{a,e) -\-p{b,e) -I- |F| (where F is the set of all find messages in transit), and 
p{a, e) = 0, it must be true that |F| > 0. Let (fc be the current value of 4>{e). We 
prove that within R time steps, at least ~ 1 find messages must arrive at a on 
e. Since the timeout is at least R time away (the state is half-fresh), a remains 
in a correct state for at least time R and by ignoring all those find messages, it 
reduces (j){e) to 1, thus satisfying Ri. 

We now show that at least ifc — I find messages arrive at a within the next 
R time steps. We use proof by contradiction. Let the current state of the system 
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be start, the state of the system after R time steps be finish and the state of 
the system after the maximum latency between a and b be middle. The time 
interval between middle and finish is more than the maximum latency for the 
link between b and a, since R is greater than the sum of the maximum a ^ b 
and b ^ a latencies. 

Suppose less than </>c — 1 messages arrived at a in time R. This means that 
all the while from start to (and including) finish, v = > 1 and p{a) yf b. 

Thus, a never forwarded any finds onto e after start. After middle, and until 
finish, there will be no finds in transit from a to b, since all the find messages 
that were in transit at start would have reached b. We have two cases. 

If p{b, e) = 0 in middle, then b will not forward any more finds on e till finish. 
By the time we reached finish, all finds that were in transit from 6 to a in middle 
would have reached a. At finish, there are no find messages in transit on e (a 
did not send any after start and neither did b after middle) and p{a, e) = 0 and 
p{b, e) = 0, and thus (j){e) = 0 in finish, which is a contradiction. 

If p{b, e) = I in middle, then b might forward a find on e between middle 
and finish and this would change p{b, e) to 0. But since no finds are forthcoming 
from a, this is the last find that would arrive from b before finish. The rest of 
the finds would have reached a by finish and thus at finish, the value of (j){e) is 
at most 1 (< 1 finds in transit, p{a,e) = 0 and p{b,e) = 0), which is again a 
contradiction. □ 

Stabilization time: From lemma|Hl each edge will stabilize to a legitimate 
state in 5R time steps where R is the maximum roundtrip time for that edge. 
The protocol stabilizes when the last edge has stabilized. Thus the stabilization 
time of the protocol is 5Rmax where Rmax is the maximum of the roundtrip 
times of all the edges. 

8 Conclusions 

We have presented a self-stabilizing Arrow queuing protocol. This was possible 
because of a decomposition of the global predicate defining “legality” of a proto- 
col state into the conjunction of a number of purely local predicates, one for each 
edge of the spanning tree. The delay needed to self-stabilize the Arrow protocol 
differs from the delay needed to self-stabilize a rooted spanning tree by only a 
constant number of round trip delays on an edge. 

Acknowledgments. The second author is grateful to Steve Reiss for helpful 
discussions and ideas. 

References 

[1] S. Aggarwal and S. Kutten. Time optimal self-stabilizing spanning tree algorithm. 
In FSTTCS93 Proceedings of the 13th Conference on Foundations of Software 
Technology and Theoretical Computer Science, Spring er-Verlag LNCS:761, pages 
400-410, 1993. 



Self Stabilizing Distributed Queuing 223 



[2] G. Antonoiu and P. Srimani. Distributed self-stabilizing algorithm for mini- 
mum spanning tree construction. In Euro-par’97 Parallel Processing, Proceedings 
LNCS:1300, pages 480-487. Springer- Verlag, 1997. 

[3] B. Awerbuch, B. Patt-Shamir, and G. Varghese. Self-stabilization by local check- 
ing and correction. In FOCS91 Proceedings of the 31st Annual IEEE Symposium 
on Foundations of Computer Science, pages 268-277, 1991. 

[4] M. Demmer and M. Herlihy. The arrow directory protocol. In Proceedings of 12th 
International Symposium on Distributed Computing, Sept. 1998. 

[5] E. Dijkstra. Self stabilizing systems in spite of distributed control. Communica- 
tions of the ACM, 17:643-644, 1974. 

[6] S. Dolev, A. Israeli, and S. Moran. Self-stabilization of dynamic systems assuming 
only read/write atomicity. Distributed Computing, 7:3-16, 1993. 

[7] M. Gouda and N. Multari. Stabilizing communication protocols. IEEE Transac- 
tions on Computers, 40:448-458, 1991. 

[8] M. Herlihy. The aleph toolkit: Support for scalable distributed shared objects. In 
Workshop on Communication, Architecture, and Applications for Network-based 
Parallel Computing ( CANPC), January 1999. 

[9] M. Herlihy, S.Tirthapura, and R.Wattenhofer. Gompetitive concurrent distributed 
queuing. In Proceedings of the ACM Symposium on Principles of Distributed 
Computing (to appear), August 2001. 

[10] M. Herlihy, S. Tirthapura, and R. Wattenhofer. Ordered multicast and distributed 
swap. Operating Systems Review, 35(l):85-96, January 2001. 

[11] M. Herlihy and M. Warres. A tale of two directories: implementing distributed 
shared objects in java. Concurrency - Practice and Experience, 12(7):555-572, 
2000 . 

[12] K. Raymond. A tree-based algorithm for distributed mutual exclusion. ACM 
Transactions on Computer Systems, 7(l):61-77, 1989. 

[13] M. Schneider. Self-stabilization. ACM Computing Surveys, 25:45-67, 1993. 

[14] G. Varghese. Self-stabilization by counter flushing. In Proceedings of the Thir- 
teenth Annual ACM Symposium on Principles of Distributed Computing, pages 
244-253, 1994. 

[15] G. Varghese, A. Arora, and M. Gouda. Self-stabilization by tree correction. 
Chicago Journal of Theoretical Computer Science, (3): 1-32, 1997. 




A Space Optimal, Deterministic, Self-Stabilizing, 
Leader Election Algorithm for Unidirectional 

Rings 



Faith E. Fich^ and Colette Johnen^ 

^ Department of Computer Science, University of Toronto, Canada 
f ichOcs . toronto . edu 

^ Laboratoire de Recherche en Informatique, CNRS-Universite de Paris-Sud, France 

coletteOlri . f r 



Abstract. A new, self-stabilizing algorithm for electing a leader on a 
unidirectional ring of prime size is presented for the composite atom- 
icity model with a centralized daemon. Its space complexity is optimal 
to within a small additive constant number of bits per processor, signif- 
icantly improving previous self-stabilizing algorithms for this problem. 
In other models or when the ring size is composite, no deterministic 
solutions exist, because it is impossible to break symmetry. 



1 Introduction 

Electing a leader on a ring is a well studied problem in the theory of distributed 
computing, with recent textbooks devoting entire chapters to it Gnu It requires 
exactly one processor in the ring be chosen as a leader. More formally, there 
is a distinguished subset of possible processor states in which a processor is 
considered to be a leader. The state of the processor that is chosen leader reaches 
and then remains within the subset, whereas the states of all other processors 
remain outside the subset. A related problem of interest is token circulation, 
where a single token moves around the ring from processor to processor, with at 
most one processor having the token in any configuration. The formal definition 
of a processor having a token is that the state of the processor belongs to a 
subset of distinguished states. 

Self-stabilizing algorithms are those which eventually achieve a desired prop- 
erty (for example, having a unique leader) no matter which configuration they 
are started in (and, hence, after transient faults occur). Dijkstra introduced the 
concept of self-stabilization and gave a number of self-stabilizing algorithms for 
token circulation on a bidirectional ring, assuming the existence of a leader |0|. 
One of these algorithms uses only 3 states per processor HD. On a unidirec- 
tional ring with a leader, Gouda and Haddix HD can perform self-stabilizing 
token circulation using 8 states per processor. 

Conversely, given a self-stabilizing token circulation algorithm on a ring of 
size n, there is an easy self-stabilizing algorithm to elect a leader using only an 
additional [log 2 n] bits per processor. Specifically, each processor stores a name 
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from {0, . . . , n — 1} as part of its state. Whenever a processor gets the token, it 
updates its name by adding 1 to the name of its left neighbour and then taking 
the result modulo n. Eventually, there will a unique processor with name 0, 
which is the leader. 

Without the ability to break symmetry, deterministic self-stabilizing leader 
election and token circulation are impossible |2|. For example, consider a syn- 
chronous system of anonymous processors. If all processors start in the same 
state with the same environment, they will always remain in the same state 
as one another. Similarly, in an asynchronous shared memory system of anony- 
mous processors with atomic reads and writes, where all registers have the same 
initial contents, or in an asynchronous message passing system of anonymous 
processors, where all communication links contain the same nonempty sequence 
of messages, many schedules, for example, a round robin schedule, will maintain 
symmetry among all the processors. Therefore, the study of deterministic algo- 
rithms for leader election and token passing in systems of anonymous processors 
has focussed on Dijkstra’s composite atomicity model with a centralized daemon 
(where a step consists of a state transition by a single processor, based on its 
state and the states of its neighbours). Even in this model, symmetry among 
equally spaced, nonadjacent processors in a ring can be maintained by an ad- 
versarial scheduler. Therefore, deterministic algorithms for leader election and 
token circulation are possible in a ring of n anonymous processors only when n 

is prime I2EIIEI. 

Randomization is a well known technique to break symmetry and randomized 
algorithms for both problems have been considered on a variety of models ITITHl 
2]. This work is beyond the scope of our paper. 

There are deterministic self-stabilizing leader election algorithms for bidirec- 
tional rings (of prime size using the composite atomicity model with a centralized 
daemon) that use only a constant amount of space per processor For uni- 
directional rings. Burns and Pachl [7] presented a deterministic self-stabilizing 
token circulation algorithm that uses 0{n^) states per processor, as well as a 
more complicated variant that uses 0(n^/logn) states per processor. They left 
the determination of the space complexity of this problem as an open question. 
Lin and Simon HE| further improved their algorithm to O (jiy^nj log n log log 
states per processor. 

Beauquier, Gradinariu, and Johnen ^ proved a lower bound of n states 
per processor for any deterministic self-stabilizing leader election algorithm on 
a unidirectional ring of size n. They also mentioned a similar lower bound of 
(n— l)/2 states per processor for token circulation due to Jaap-Henk Hoepman. 

In Section 0 we present a deterministic self-stabilizing leader election al- 
gorithm for unidirectional rings (of prime size using the composite atomicity 
model with a centralized daemon) in which the number of states is 0(n). This 
matches the lower bound to within a small constant factor. Hence, our algorithm 
matches the number of bits of storage used at each processor to within a small 
additive constant of the number required by the lower bound. An algorithm 
for self-stabilizing token circulation on a unidirectional ring can be obtained by 
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combining our algorithm for electing a leader with Gouda and Haddix’s token 
circulation algorithm that assumes the existence of a leader m The result- 
ing algorithm has provably optimal space complexity to within a small additive 
constant, solving Burns and Pachl’s open question. 

Our algorithm was inspired by and is closely related to Burns and Pachl’s 
basic algorithm. To achieve small space, our idea is to time share the space: two 
pieces of information are stored alternately in one variable instead of in parallel 
using two different variables. However, the correct implementation of this simple 
idea in a self-stabilizing manner is non-trivial. 

When developing our algorithm, we use Beauquier, Gradinariu, and Johnen’s 
alternating schedule approach rrn to simplify the description and the proof of 
correctness. In Section 0, we give a more careful description of our model of 
computation, define the set of alternating schedules, and state some important 
properties of executions that have alternating schedules. Most of these results 
describe how information flows from one processor to another during the course 
of an execution. 



2 The Model 



We consider a system consisting of n identical, anonymous processors arranged 
in a ring, where n is prime. The value of n is known to the processors. The 
left neighbour of a processor P will be denoted Pl and its right neighbour 
will be denoted Pr. The ring is unidirectional, that is, each processor can only 
directly get information from its left neighbour. The distance from processor P 
to processor Q is measured starting from P and moving to the right until Q is 
reached. In particular, the distance from P to Pl is n — 1. 

In any algorithm, each processor is in one of a finite number of states. A 
configuration specifies the state of every processor. An action of a processor is a 
state transition, where its next state depends on its current state and the state of 
its left neighbour. Note that the next state might be the same as the current state. 
Only one processor performs an action at a time. This is Dijkstra’s composite 
atomicity model with a centralized daemon 0. An algorithm is deterministic 
if, for each processor, its actions can be described by a total state transition 
function from the cross product of its state set and the state set of its left 
neighbour. In other words, in every configuration, each processor has exactly 
one action it can perform. A processor is enabled in a configuration if there is 
an action that causes the processor to change its state. Some authors prefer to 
describe a deterministic algorithm using partial state transition functions, not 
defining those transitions in which a processor is not enabled. 

A schedule is a sequence whose elements are chosen from the set of n proces- 
sors. If P is the t’th element of the schedule, we say that t is a step of P and P 
takes a step at time t. A processor P takes a step during the time interval [t, <"] 
if it takes a step at some time t' , where t <t' < t" . In particular, if t > t” , then 
the interval [t, t"\ is empty and no processor takes a step during this interval. 
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An execution is an infinite sequence of configurations and processors 

config{Qi),proc{l), con/i(/(l),proc(2), config{2), . . . 

where configuration config{t) is obtained from configuration configit— 1) by the 
action of processor proc(t), for all t > 0. We say that config{t) is the configuration 
at time t of the execution. The initial configuration of this execution is config{0), 
the configuration at time 0. The schedule of the execution is the subsequence 
proc(l),proc(2), . . . of processors. An infinite schedule or execution is fair if 
every processor appears in the sequence infinitely often. 

Fix an execution. If processor P takes a step at time T, then the state of 
P at time T is influenced by the state of Pl at time T — 1. If P does not take 
any steps in the interval [T + then it will have the same state at T and 
t'. Similarly, if Pj^ does not take any steps in the interval [t + 1,T], then it will 
have the same state at t and T — 1 . Thus the state of P^ at time t influences the 
state of P at time t' . The following definition extends this relationship to pairs 
of processors that are further apart. 

Definition 1. Suppose Pq, Pi, . . . , Pk are k + 1 < n consecutive processors, in 
order, rightwards along the ring. Then the state of Pq at time to infiuences the 
state of Pk at time t' > to, denoted 

{Pq, to) — >■ {Pk, t'), 

if and only if there exist times to < t\ <■■■< tk < t' such that Pk takes no 
steps during the time interval \tk + l, t'] and, for i = 1, . . . ,k, Pi-i takes no steps 
during the time interval [ti-i + l,ti], but Pi takes a step at time ti. 

This definition of influence only captures the communication of information 
around the ring. It does not capture knowledge that a processor retains when it 
takes steps. For example, if P takes a step at time T, then {P, T — 1) {P, T). 

The following results are easy consequences of the definition. 

Proposition 1. Suppose P, P' , and P" are distinct processors with P' on the 
path from P to P” . If {P,t) — >■ {P",t") then there exists t < t' < t" such that 
P' takes a step at time t' , {P,f) — >■ {P',t'), and {P',t') — >■ {P",t"). Conversely, 
if{P,t) {P',t') and {P',t') {P",t"), then {P,t) {P",t"). 



Proposition 2. Suppose P takes no steps in the interval and P' takes 

no steps in the interval [t' + ^,T'], where t <T and t' < T' . Then the follow- 
ing are equivalent: {P,t) — >■ {P',t'), {P,t) — >■ {P',T'), {P,T) — >■ {P',t'), and 
{P,T) ^ {P',T'). 

A subset of the configurations of an algorithm is closed if any action per- 
formed from a configuration in this set results in a configuration in this set. Let 
H he a, predicate defined on configurations. An algorithm stabilizes to H under a 
set of schedules S if there is a closed set of configurations, L, all of which satisfy 
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H, such that every execution whose schedule is in S contains a configuration 
that is in L. The configurations in L are called safe. When an execution reaches 
a safe configuration, we say that it has stabilized. The stabilization time of an 
algorithm is the maximum, over all executions in S, of the number of actions 
performed until the execution stabilizes. A self-stabilizing algorithm is silent if 
no processors are enabled in safe configurations. 

Let LE be the predicate, defined on configurations of an algorithm, that is 
true when exactly one processor is a leader (i.e. its state is in the specified set). 
An algorithm that stabilizes to LE under the set of all fair schedules is called 
a self-stabilizing leader eleetion algorithm. Notice that, once an execution of a 
leader election algorithm stabilizes, the leader does not change. This is because 
processors change state one at a time, so between a configuration in which one 
processor is the only leader and a configuration in which another processor is 
the only leader, there must be an unsafe configuration. 

We present an algorithm in Section 01 that stabilizes to LE under the set 
of alternating schedules, defined in Section 12. IL using 5 n states per processor. 
Beauquier, Gradinariu, and Johnen ^ prove the following result about stabi- 
lization under the set of alternating schedules. 

Theorem 1. Any algorithm on a ring that stabilizes to predieate H under the 
set of alternating schedules can be converted into an algorithm that stabilizes to 
H under all fair schedules, using only double the number of states (i.e. only one 
additional bit of storage) at each processor. 

Applying their transformation to our algorithm gives a self-stabilizing leader 
election algorithm that uses lOn states per processor. 



2.1 Alternating Schedules 

A schedule is alternating if, between every two successive steps of each proces- 
sor, there is exactly one step of its left neighbour and exactly one step of its 
right neighbour. Any round robin schedule is alternating. For a ring of size 5 
with processors T*!, T2, F3, P4, P5 in order around the ring, the finite schedule 
Pi, P2, P5, P4., Pi, P5, P3, Pi, P2, P3, Pi, P2, P5, Pi, Pi, P5, P3, P2 is also an alter- 
nating schedule. It is equivalent to say that a schedule is alternating if, between 
every two steps of each processor, there is at least one step of each of its neigh- 
bours. 

The assumption of an alternating schedule allows us to determine more situ- 
ations where the state of a processor at one step influences the state of another 
processor at some later step. The proofs of the following lemmas are by induc- 
tion on the distance from P to Q and can be found in the complete paper. They 
will be used in the proof that our algorithm stabilizes to LE under the set of 
alternating schedules. 

The first result says that if a step of P influences a step of Q, then earlier or 
later steps of P influence correspondingly earlier or later steps of Q. 
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Lemma 1. Consider an alternating schedule in which processor P takes k steps 
in the interval -I- 1, T] and processor Q takes k steps in the interval [f + 1, T'] . 
Then (P,t) {Qjt') if and only if (P,T) — )> {Q,T'). 

Another important property is that each step of each processor will eventually 
influence some step of every other processor. 

Lemma 2. Let P and Q be distinct processors. Suppose P takes a step at time 
t of an alternating schedule. Then there exists a step t' > t of Q such that 
(P,t) — >• {Q,t'). 

The next results bound when a processor will influence another processor, as 
a function of the distance from one to the other. 

Lemma 3. Let P and Q be distinct processors, where the distance from P to Q 
is k. If P takes at least fc -|- 1 steps by time t' in an alternating schedule, then 
there exists a time t < t' such that (P,t) — >■ {Q,t'), P takes at most k steps in 
the interval [t + I,t!], and P takes a step at time t. 

Lemma 4. Let P and Q be distinct processors, where the distance from P to Q 
is k. If Q takes at least k steps by time t' in an alternating schedule, then there 
exists a time t < t' such that {P,t) — > {Q,t'), Q takes at most k steps in the 
interval [t,t'], and either t = 0 or P takes a step at time t. 

Finally, the difference between the numbers of steps that have been taken by 
two processors can be bounded by the distance from one processor to the other. 

Lemma 5. Suppose the distance from P to Q is k and (P,f) — >■ {Q,t'). If Q 
takes at least m steps by time t' in an alternating schedule, then P takes at least 
m — k steps by time t. 



Lemma 6. Suppose the distance between P and Q is k. If P takes m steps by 
time t, then Q takes between m — k and m + k steps by time t. 

3 A New Leader Election Algorithm 

In this section, we present a deterministic leader election algorithm for a uni- 
directional ring that uses 5n states per processor and stabilizes under the set 
of alternating schedules within 0{n^) time. We begin by describing some of the 
ideas from Burns and Pachl’s token circulation algorithm [Z] and how they are 
used in our leader election algorithm. 

In a token circulation algorithm, some, but not all, of the tokens must disap- 
pear when more than one token exists. Similarly, in a leader election algorithm, 
when a ring contains more than one leader, some, but not all, of the leaders 
must become nonleaders. Because the only direct flow of information is from a 
processor to the processor on its right, the first token or leader a processor can 
receive information from is the first one that is encountered travelling left from 
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the processor. We call this the preceding token or leader. The following token 
or leader is the closest one to the processor’s right. We define the strength of a 
token or leader to be the distance to it from the preceding token or leader. If 
there is only one token or leader in a ring of size n, then its strength is n. 

In Burns and Pachl’s basic algorithm, each processor has two variables: one 
to store its distance from the preceding token (which, in the case of a processor 
with a token is the strength of that token) and the other to store the strength 
of the preceding token. The value of each variable is in [l,n]. Thus, the total 
number of states per processor is O(n^). 

A processor whose left neighbour has a token knows that it is at distance 1 
from the preceding token. Any other processor can determine its distance from 
the preceding token by adding 1 to the corresponding value of its left neighbour. 
Every processor can obtain the strength of the preceding token directly from its 
left neighbour: from the first variable, if its left neighbour has a token and from 
the second variable, if its left neighbour does not have a token. 

When a processor with a token learns that the preceding token is stronger, 
it destroys its own token. On a ring whose size is prime, the distances between 
successive tokens cannot be all identical. Thus, extra tokens will eventually dis- 
appear. 

In our algorithm, the state of each processor consists of two components: a 
tag X G {c, d, B, C, D} and a value v € [1, n]. We say that a processor is a leader 
if its tag is B, (7, or D; otherwise it is called a nonleader. 

The safe configurations of our algorithm each contain exactly one leader, 
which is in state {D,n). In addition, the nonleader at distance i from the leader 
is in state {d,i), for i = l,...,n — 1. An example of a safe configuration is 
illustrated in Figure EKa). It is easy to verify that no processors are enabled in 
safe configurations. Hence, our algorithm is silent. 

Actions of our algorithm are described by specifying the state (A, u) of a 
processor P and the state (Xl,vl) of its left neighbour Pl and then giving P’s 
new state {X',v'). Such an action will be written (Xl,vl) (X,v) !->■ {X',v'). 
Instead of presenting all the actions at once, we present them in small groups, 
together with a brief discussion of their intended effect. 

For a leader, v usually contains its strength, i.e. its distance from the preceding 
leader. Sometimes, nonleaders are used to determine the strength of a leader. In 
this case, the processor has tag d and v contains its distance from the preceding 
leader. A nonleader can also have tag c, indicating that v is conveying strength 
information from the preceding leader to the following leader. 

An example of an unsafe configuration of our algorithm is illustrated in Figure 
mb). Here P and P' are leaders with strength 4, P" is a leader with strength 3, 
and the other 8 processors are nonleaders. 

When the tag of a leader is B or D, this signals the sequence of nonleaders 
to its right to compute their distance from the leader, as follows: The right 
neighbour of the leader sets its value to 1 and each subsequent nonleader sets 
its value to one more than the value of the nonleader to its left. They set their 
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Fig. 1. Safe and unsafe configurations of onr algorithm 



tags to d, to relay the signal. Because the schedule is alternating, the signal is 
guaranteed to reach the entire sequence of nonleaders to the right of the leader. 

1. (Xl,vl) (X,v) !->■ (d, 1) for Xl = B, D a,nd X = c,d 

2. (d,VL) (X,v) !->■ (d, 1 -I- (ul mod n)) for X = c, d and ul n — 1 

When the tag of a leader is C, this signals the nonleaders to its right to convey 
the strength of this leader, by copying the value from their left neighbour and 
setting their tag to c. For example, in Figure [D^b), if the processor at distance 2 
from P” takes the next step, its state will change from (d, 2) to (c, 3). 

3. (Xl,vl) (d,v) ^ (c,vl) forXL = c,C 

Our algorithm ensures that if a processor has tag C or c immediately before 
it takes a step, then it has neither tag immediately afterwards. This implies that 
processors with tag c will not perform either of the following two actions after 
taking their first step. When the tag of its left neighbour is C, processor P with 
tag c simply treats its left neighbour’s tag as if it were D and enters state (d, 1). 
However, when its left neighbour’s tag is c, it is possible that all processors have 
tag c. To ensure that the ring contains at least one leader, P becomes a leader. 

4. {C,vl) (c,v) H>(d, 1) 

5. (c,vl) (c,v) H>(H, 1) 

A processor in state (d, n — 1) suggests that its right neighbour is the only 
leader. If that right neighbour is a nonleader, it can correct the problem by 
becoming a leader. This avoids another situation where no leader may exits. 

6. (d, n— 1) (X,v) I— 7>(H,1) for X = c, d 

A leader with tag B is a, beginner, it has performed n or fewer steps as a 
leader. For beginners, the value v records the number of steps for which the 
processor has been a leader, up to a maximum of n. Thus, when a nonleader 
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becomes a leader, it begins in state {B, 1). After a leader has performed n steps 
as a beginner, it should get tag D and record its strength in v. But when its left 
neighbour has tag c, the processor cannot determine its own strength. Therefore 
it waits in state {B, n) until its next step. In the meanwhile, its left neighbour 
will take exactly one step and, hence, will have a different tag. Thus, a processor 
performs at most n + 1 consecutive steps as a beginner. 



7. 

8 . 

9. 

10 . 



{d,VL) 

(c, Vl) 



(B,v) H> {B,V+ 1) 

{B,n) !->■ (H, 1 + (ui mod n)) 
{B,n) ^ (D,l) 

{B, n) I— >■ {B, n) 



for Xl = B,c,d and v ^ n 



When the system is in a configuration with multiple leaders, some of these 
leaders have to be destroyed. If P is a leader whose left neighbour has tag C or 
D, then P resigns its leadership by setting its state to {d, 1). However, if P’s left 
neighbour has tag P, then P waits in state {D, 1). Provided P’s left neighbour 
stays a leader long enough, P will be destroyed, too. 

11. (Xl,vl) (X,v) ^ (d,l) for Xl = C,D and X = B,C,D 

12. {B,vl) (X,v) ^ (B>,1) for X = C,D 



When P is a leader whose left neighbour, P^, has tag c, then vl contains 
the strength of the preceding leader. In this case, P can compare its strength 
against that of the preceding leader. If P is at least as strong, it remains a leader. 
Otherwise, P resigns its leadership by setting its tag to d. However, the value vl 
does not provide information from which P can compute its distance from the 
preceding leader. Consequently, P sets its value to n, to act as a place holder 
until P’s next step, when P^ will have a different tag. 

13. (c,vl) (X,v) I — (D,v) for X = C,D and v > vl 

14. (c,vl) (X,v) ^ {d,n) for X = C,D &rrd v < vl 

A processor can enter state (d, n) only by resigning its leadership. When the 
left neighbour of a leader is in state (d, n), the leader cannot use vl to determine 
its strength. In this case, the leader leaves its value v unchanged. 

15. (d, n) (D,v) 1 -^ (D,v) 



When its left neighbour Pl has tag d, a leader P that is not a beginner 
can update its value with a better estimate of its strength. Such a processor 
usually alternates its tag between C and D. The only exception is when P’s left 
neighbour has value n—1, which indicates that P is the only leader. In this case, 
P stays in state {D,n). 

16. (d,VL) {D,v) H>(C, 1 + ul) 

17. (d, ul) (C, u) H> (P>, 1 + (wl mod n)) 

18. (d, n— 1) {D,v) 1 -^ {D,n) 



for Vl ^ n — l,n 
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4 Properties of the Algorithm 

In this section, we present a number of results about the behaviour of our al- 
gorithm. They are useful for the proof of correctness presented in Section 0 
Throughout this section and Section El we assume that all schedules are alter- 
nating. 

The first result relates the value of a processor with tag d to its distance from 
a leader or a processor that has just resigned its leadership. It can be proved by 
induction. 

Lemma 7. Suppose (P,t) — >■ the distance from P to Q is k, and Q has 

tag d at time t' . If, at time t, P is a leader or has state (d,n), then, at time t' , 
either Q has value n or value at most k. Conversely, if Q has value k at time 
t' , then, at time t, either P is a leader or has state (d,n). 

A leader that is not a beginner cannot have become a leader recently. 

Lemma 8. Let be an interval that contains at most n steps of processor 

P. If P has tag C or D at time t' , then P is a leader throughout the interval 

Proof. Suppose that P becomes a leader during the interval [t -I- l,t'\. At that 
time, P has state {B, 1). It cannot get tag C or D until it has performed at least 
n more steps, which occurs after time t' . 

The next result identifies a situation in which a leader cannot be created. 

Lemma 9. Suppose (P,f) — >■ {Q,t') and Q takes at least one step before t' . If 
P is a leader throughout the interval [t,t' — 1], then Q does not become a leader 
at time t' . 

Proof sketch. Suppose Q becomes a leader at time t' . It follows that Ql has state 
{d,n — 1) at time t' — 1. Then Lemma 0is applied to obtain a contradiction. 

4.1 Experienced Leaders 

An experienced leader is a processor that has been a leader long enough so that 
the fact that it is a leader influences and has been influenced by every other 
processor. Formally, an experienced leader is a processor with tag C or D that 
has taken at least n steps. Then, either an experienced leader has remained a 
leader since the initial configuration, or it has served its full time as a beginner, 
since last becoming a leader. 

The existence of an experienced leader in a configuration of an execution 
provides a lot of information about what actions may occur. 



Lemma 10. No new leader will be created whenever there is an experienced 
leader. 
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Proof. To obtain a contradiction, suppose there is a time t" at which P is an 
experienced leader and Q becomes a leader. Since P has taken at least n steps, 
it follows by Lemma 0 that Q has taken at least one step before time t” . This 
implies that, at time t", Q performs action 6, Ql has state {d,n — 1), and 

p^Q,Ql- 

From Lemma 0 and Proposition 0 there are times t < t' < t” such that 
(P, t) — ?► {QL,t') — >■ Ql takes a step at time t', and P takes at most n 

steps in the interval [t, t"]. Then Lemma El implies that P is a leader throughout 
this interval. Since the distance from P to Ql is at most n — 2, it follows from 
Lemma Q that Ql cannot have value n — 1 at time P, However, Ql has state 
{d, n — 1) at time t” and takes no steps in the interval [P + 1, f"]. Thus Ql has 
state (d, n — 1) at time t' . This is a contradiction. 

The proofs of the next two results appear in the full paper. They use Propo- 
sition 121 and Lemmas □ 1111111311 and 0 

Lemma 11. If an experienced leader has value z; > 1, then the z) — 1 processors 
to its left are nonleaders. 

This says that the value of an experienced leader is a lower bound on its 
strength. 

Lemma 12. While a processor is an experienced leader, its value never de- 
creases. 

5 Proof of Self-Stabilization 

Here, we prove that the algorithm presented in Section 0 stabilizes to LE under 
the set of alternating schedules. 

The proof has the following main steps. First, we show that every execution 
reaches a configuration in which there is a leader. Then, from some point on, all 
configurations will contain an experienced leader. By Lemma^l no new leaders 
will be created, so, eventually, all leaders will be experienced leaders. As in Burns 
and Pachl’s algorithm, if there is more than one leader, they cannot have the 
same strength, because the ring size is prime. Thus resignations must take place 
until only one leader remains. Finally, a safe configuration is reached. 

Lemma 13. Consider any time interval [t, t'] during which each processor takes 
at least n steps. Then there is a time in [t— 1, t'] at which some processor is leader. 

Proof. Without loss of generality, we may assume that t = 1. To obtain a con- 
tradiction, suppose that no processor is a leader in [0,t']. Only actions 2 and 
3 are performed during since all other actions either require or create a 

leader. 

Let v be the maximum value that any processor has as a result of performing 
action 2 for its first time. Then 1 < z; < n. Say processor Pq has value v at time 
to as a result of performing action 2 for its first time. Let Pi be the processor 
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at distance i from Pqj for i = 1, . . . , n — r;. Then by Lemma |3 there exist times 
< • • • < tn-v such that (Po,to) — >■ (Pi,ti) —>■•••—>■ {Pn-v,tn-v) and Pi takes 
a step at time ti for i = 1, . . . , n — u. Note that Pn-v takes at most n steps by 
time tn-v\ otherwise, Lemma 0 implies that Pq takes at least two steps before 
to- This is impossible, since Pq cannot perform action 3 twice in a row. Hence, 
t < f' 

^n—v — ^ • 

It follows by induction that processor P„_i_„ has state (d, n — 1) at time 
tn-i-v But then Pn-v performs action 6 at time tn~v and becomes a leader. 
This is a contradiction. 

Lemma 14. If there is only one experieneed leader and it has taken at least 3n 
steps, then it eannot resign its leadership. 

Proof sketch. To obtain a contradiction, suppose that processor Pg has taken at 
least 3n steps before time to, it is the only experienced leader at time to — 1, and 
it resigns its leadership at time to. Then Pg performs action 14 at time tg. Let 
Vo denote the value of Pg at time fg — 1 and let fcg = 0. 

We prove, by induction, that there exist processors Pi,...,P„_i, values 
ug < < • • • < Vn-i, distances 1 < fci < • • • < kn-i < ri, and times 

ti,t'i, . . . , tn-i,t'n-i such that, for i = 1, . . . , n — 1, 

— Pi performs action 14 at time ti < to, 

— Pi takes a step at time t' < ti, 

— the distance from Pi to Pg is ki, 

— Pi has value Vi at time ti — 1, and 

But this is impossible. 

Lemma 15. After each processor has taken 6n + 1 steps, there is always an 
experienced leader. 

Proof. Consider any execution of the algorithm. Let t be the first time at which 
every processor has taken at least 3n steps and let t" > t he the first time such 
that all processors have taken at least n steps in [t+ l,t"]. By Lemma IT^ there 
is a time t' G [t,t”] at which some processor P is a leader. If P has tag C or 
D at time t' , then P is an experienced leader. Otherwise, P has state (P, v) for 
some value v >1. Unless an experienced leader is created, P will perform action 
7 at each step until it has state (B,n), it will perform action 10 at most once, 
and then perform action 8 or 9 to become an experienced leader. Note that, at 
t' , every processor has taken at least 3n steps, so Lemma O implies that there 
is at least one experienced leader in every subsequent configuration. 

By time t, some processor has taken exactly 3n steps, so Lemma El implies 
that no processor has taken more than [7n/2j steps. Similarly, some processor 
takes exactly n steps in the interval \t + l,t"], so no processor takes more than 
[3n/2j steps in this interval. This implies that P takes at most 5n steps by time 
t” . Therefore P becomes an experienced leader within 6n+l steps, since P takes 
at most n -I- 1 steps after t' until this happens. 
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Lemma 16. After each processor has taken at least [15n/2j +2 steps, all leaders 
are experienced. 

Proof. Consider any execution of the algorithm and let t be the first time at 
which every processor has taken at least 6n+l steps. By LemmaEl the execution 
contains an experienced leader at all times from t on. Lemma El implies that 
no new leader will be created after time t. Any beginner at t will become an 
experienced leader or resign its leadership by the time it has taken n + 1 more 
steps. By time t, some processor has taken exactly 6n + 1 steps, so Lemma 0 
implies that no processor has taken more than [13n/2j steps. Thus, by the time 
each processor has taken [15n/2j steps, all leaders are experienced. 



Lemma 17. Consider any interval during which the set of leaders does 

not change, all leaders are experienced, and every processor performs at least 
n + 1 steps. Then, at t' , the value of every leader is equal to its strength. 

Proof. Let Q be a leader with value v at t' and let P be the processor such that 
the distance from P to Q is u. Suppose T' is the last time at or before t' at 
which Q takes a step and Ql has tag d. Processor has value v — 1 at T' . By 
Lemma0 there is a time T <T' such that {P,T) — >• {Ql,T'), P takes at most 
V — 1 steps in [T + 1,T'], and P takes a step at time T. Lemma 0 implies that, 
at time T, either P is a leader or has state (d,n). 

If P has state (d, n) at time T, then P must have performed action 14 at time 
T, resigning its leadership. But t < T < T' < t' and the set of leaders doesn’t 
change throughout the interval Thus P is a leader at time T and, hence, 
at time t' . 

By Lemma im the V — 1 processors to the left of Q are nonleaders at t'. 
Hence, at t' , processor P is the leader preceding Q and Q has strength v. 



Lemma 18. Let t be any time at which there is more than one leader and all 
leaders are experienced. If every processor takes at least [5n/2j +2 steps in [t, t''], 
then some processor resigns during [t + 

Proof. To obtain a contradiction, suppose that, during [t, <"], all processors take 
at least [5n/2j +2 steps and the set of leaders does not change. Let t' > the the 
first time such that every processor performs at least n+1 steps in Then 

Lemma E] implies that, throughout the value of every leader is equal to 

its strength. Let P be one leader, let P' be the following leader, and let v and v' 
denote their respective strengths. The processor that takes step t' takes exactly 
n steps during the interval \t,t' — 1]. Then Lemma El implies that P takes at 
most [3n/2j steps during 

Consider the first time T > t' at which P has tag C. Then P performs at most 
2 steps in [t' + 1,T]. By Lemma 0 there exist times T < Ti < ■ ■ ■ < T„'_i < T' 
such that (P,T) — >• (Pi,Ti) —>••••—>• (P„/_i, T„/_i) — t> (P',T'), where Pi is the 
processor at distance i from P. At time Ti, processor Pi performs action 3 and 
gets state (c, v), for i = 1, . . . , u' — 1. 
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From Lemma 0 we know that P takes at most v' < n — 1 steps in [T + 1, T']. 
Hence P takes at most [5n/2j -|- 1 steps during [t,T']. Therefore T' < t” . Since 
no processor resigns during [t + processor P' performs action 13 at time 

T' . Therefore v' > v. 

Since P is an arbitrary leader, this implies that the strengths of all leaders 
are the same. Hence, the ring size n is divisible by the number of leaders. This 
is impossible, because n is prime and the number of leaders lies strictly between 
1 and n. 



Lemma 19. From any configuration in which there is only one leader and that 
leader is experienced, the algorithm in Section 0 reaches a safe configuration 
within O(n^) steps. 

Proof. Consider a time t at which there is only one leader P and suppose P 
is experienced at t. By Lemmas 0 and 0 there exists a step t' of Pl such that 
(P, t) — >• {Pi^,t') and P takes at most n steps in [t, t']. If Pl has tag d at time t', 
let r' = t'; otherwise, let T' be the time of Pl’s next step. Then Pl has tag d 
at time T' . By LemmaQJ, there exists a time T such that (P, T) — ?> {Pl, T') and, 
by Lemma [3 Pl will have value n — 1 at time T' . 

Processor P gets state (D,n) at its first step following T'. From then on, P 
remains in state {D,n), performing only actions 18 and 13 . LemmaQand an 
easy induction on k show that if the distance from P to Q is k, 

{D,n) at time t” , and {P,t") — )> {Q,T”), then Q has state {d,k) 

Thus, within 0{n^) steps, a safe configuration is reached. 

Our main result follows directly from these lemmas. 

Theorem 2. The algorithm in Section 0 stabilizes to LE within 
under any alternating schedule. 

6 Conclusion 

We have presented a deterministic, self-stabilizing leader election algorithm for 
unidirectional, prime size rings of identical processors, proved it correct, and 
analyzed its complexity. The number of states used by this algorithm is lOn 
per processor, matching the lower bound ^ to within a small constant factor. 
Combined with Gouda and Haddix’s algorithm H3!, we get a deterministic, self- 
stabilizing algorithm for token circulation on unidirectional prime size rings that 
uses only a linear number of states per processor. This answers the open question 
in [3 of determining the space complexity of of self-stabilizing token circulation 
on unidirectional rings. We believe our work sheds new insight into the nature 
of self-stabilization by more precisely delineating the boundary between what is 
achievable and what is not. 

Through our work with alternating schedules, we have improved our under- 
standing of how the state of one processor influences the states of other proces- 
sors. This enabled us to store two pieces of information alternately in a single 



P has state 
at time T” . 



0{n^) steps 
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variable, yet have both pieces of information available when needed. This tech- 
nique may be useful for designing other space efficient self-stabilizing algorithms 
on the ring and, more generally, on other network topologies. 

Our algorithm is silent under an alternating schedule. However, when com- 
bined with the deterministic token algorithm, it is not silent: the deterministic 
tokens circulate forever. One remaining open question is whether there exists a 
silent, deterministic, self-stabilizing leader election on a unidirectional ring that 
uses only a linear number of states per processor. 
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Abstract. Distributed randomized algorithms, when they operate un- 
der a memoryless scheduler, behave as finite Markov chains: the prob- 
ability at n-th step to go from a configuration x to another one y is a 
constant p that depends on x and y only. By Markov theory, we thus 
know that, no matter where the algorithm starts, the probability for the 
algorithm to be after n steps in a “recurrent” configuration tends to 1 
as n tends to infinity. In terms of self-stabilization theory, this means 
that the set TZec of recurrent configurations is included into the set C of 
“legitimate” configurations. However in the literature, the convergence 
of self-stabilizing randomized algorithms is always proved in an elemen- 
tary way, without explicitly resorting to results of Markov theory. This 
yields proofs longer and sometimes less formal than they could be. One 
of our goals in this paper is to explain convergence results of randomized 
distributed algorithms in terms of Markov chains theory. 

Our method relies on the existence of a non-increasing measure <p over 
the configurations of the distributed system. Classically, this measure 
counts the number of tokens of configurations. It also exploits a function 
D that expresses some distance between tokens, for a fixed number k 
of tokens. Our first result is to exhibit a sufficient condition Prop on p 
and D which guarantees that, for memory less schedulers, every recurrent 
configuration is legitimate. We extend this property Prop in order to 
handle arbitrary schedulers although they may induce non Markov chain 
behaviours. We then explain how Markov’s notion of “lumping” naturally 
applies to measure D, and allows us to analyze the expected time of 
convergence of self-stabilizing algorithms. The method is illustrated on 
several examples of mutual exclusion algorithms (Herman, Israeli- Jalfon, 
Kakugawa- Yamashita) . 



1 Introduction 

A randomized distributed system is a network of N finite-state processes whose 
states are modified via randomized local actions. A process P is enabled when 
its state and the states of its neighbours allow P to be the siege of some ac- 
tion. A scheduler is a mechanism which, at each step, selects a subset of enabled 
processes: all the selected processes then execute synchronously an action and 
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change their state accordingly. A scheduler is memoryless when the choice of 
the enabled processes depends only on the current network configuration x, i.e. 
the current value of the A^-tuple of process states. Distributed randomized al- 
gorithms, when they operate under a memoryless scheduler, behave as finite 
Markov chains: the probability at n-th step to go from a configuration x to an- 
other one y is a constant p that depends on x and y only. By Markov theory, 
we thus know that, no matter where the algorithm starts, the probability for 
the algorithm to be after n steps in a “recurrent” configuration tends to 1 as 
n tends to infinity. In terms of self- stabilization theory m, this means that 
the set TZec of recurrent configurations is included into the set £ of “legitimate” 
configurations. However, in the literature (see, e.g., jIDfl I IDH2pi:i) ) the conver- 
gence of self-stabilizing randomized algorithms is often proved in an elementary 
way, without explicitly resorting to results of Markov theory. This yields proofs 
longer and sometimes less formal than they could be. One of our goals in this 
paper is to explain convergence results of randomized distributed algorithms in 
terms of Markov chains theory. 

We focus here on (/^-algorithms where (/? is a measure over configurations that 
characterizes the set C of legitimate states, and never increases along any compu- 
tation. For example, for mutual exclusion algorithms, a typical measure ip counts 
the number of “tokens”, the legitimate configurations being those with only one 
token. Our first contribution is to exhibit a general property of (/3-algorithms, 
called Prop, which guarantees that, under a memoryless scheduler, every recur- 
rent configuration is legitimate {TZec C £). We also show that this property Prop 
extends naturally to guarantee the convergence of distributed algorithms under 
arbitrary schedulers, although these algorithms may not behave any longer as 
Markov chains. Finally we explain how to use the “lumping” method of Markov 
theory in order to derive a simpler distributed algorithm, which allows us to 
compute the expected time of convergence of the original algorithm. This gives 
us a formal justification for the method used, e.g., by Herman m- 

The plan of the paper is as follows. After some preliminaries (Section |^, 
we define sufficient property Prop that ensures convergence of distributed algo- 
rithms under memoryless scheduler (Section E|- Then Prop is extended in order 
to treat arbitrary schedulers in Section 0 We explain how to use the lumping 
method in order to analyze the time of convergence in Section El We conclude 
in Sectional 

2 Preliminaries 

For the sake of simplicity we focus in this paper on the simple topology of linear 
networks, i.e. rings. We also assume that the communication between processes 
is done through the reading of neighbours’ state. 

2.1 Randomized Uniform Ring Systems 

The following material is borrowed from Uni (cf. P0). A randomized uniform 
ring system is a triple R = {N, — Q) where N is the number of processes in the 
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system, — >■ is a state transition algorithm, and Q is a finite set of process states. 
The N processes Pq, Pi, ..., Pn-i form a ring: the fact that there is an edge from 
Pi-i to Pi means that Pi can observe the state qi-i of Calculations on 

indices i of processes are done modulo N. Let Q be the state set of Pi. The 
system is uniform in the sense that — >■ and Q are common to every process. A 
configuration of R is an A^-tuple of process states; if the current state of process 
Pi is qi € Q, then the configuration of the system is x = {qo, qi, - ■ ■ , qN-i)- We 
denote by X the set of all configurations, i.e., X = . The state transition 

algorithm — >■ is given as a set of guarded commands of the form: 

IF <guardi> THEN <commandi> 

IF <guard2> THEN <conunand2> 

IF <guardm> THEN <commandm> 

Here a guard is a predicate of the form g{qi-i,qi). A command modifies the state 
qi of process Pi. It is either a deterministic action of the form q[ = h(qi~i, qi), 
or a probabilistic action of the form qi = h‘^{qi-i,qi) with probability > 0 
{oj G {0, 1} and J2uj=o iP^ ~ ^)- process Pj is enabled if a guard 

g{qj-i,qj) is true. The set of indices of enabled processes of a configuration x 
is denoted by E{x). If no process is enabled at configuration x (S(x) = 0), we 
say that there is a deadlock. Otherwise, a scheduler A, is a mechanism which 
selects a nonempty subset S of £{x). A transition then leads from x to x' = 

^q'w) 



/ ^ f qj if j ^ S, 

^ \ rj if j e S. 

Here rj is the result of executing command/jQ), where k{j) is the smallest index 
£ such that guards holds for qj-i and qj. More precisely: rj = 
with probability if command^Q) is probabilistic; rj = h^j'^{qj-\,qj) with 

probability 1 otherwise. Such a transition is written x — > x' , or sometimes more 

g 

simply X x' . The probability associated to this transition, written p(x — >■ x'), 

is always positive. More precisely, p{x A- x') = OjeS* P‘kQ) ^h^re S* is the 
subset of indices j of S such that commandj.^) is probabilistic. Applying — >■ i 
times is written — The reflexive transitive closure of — >■ is denoted — >■*. If the 
scheduler A always selects exactly one enabled process (i.e., IS”! is always equal 
to 1), then it is a central scheduler. Otherwise, it is a distributed scheduler. 

Example 1. In Herman’s mutual exclusion algorithm m, the scheduler A is 

g 

distributed and “maximal”: at each step of computation x — > y, the set S of 
selected processes is exactly the set £(x) of all the enabled processes. 

The set of states is Q = {0, 1}, and the number of processes is odd. The 
expression q means 9+1 where + is addition modulo 2. For each word u = 



^ One can also model similarly systems where Pi observe not only the state of Pi-i 
but also that of Pi+i, see Israeli- Jalfon’s algorithm lexarrmle ll II) . 
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9 i 92 • • ■ C the expression u denotes < 11 ^ 2 ’ '' Ik- The transition system — ?> is 
defined by: 

IF g, ^ g,_i THEN q[ = g, 

IF qi = qi_i THEN g- = g^ with probability 1/2 
or qi with probability 1/2. 

For every configuration x, all the letter positions are enabled. When a process 
Pj has a state a equal to the state of its left neighbour, it is enabled for 
a probabilistic action, and j G S* . For example, consider the configuration 
X = 011101010. S* is here the set of indices of the 3 letters in bold. We have 
X ^ x' with probability (1/2)'^, where x' is the “opposite” string x = 100010101. 



Without loss of understanding, we will abbreviate henceforth R : {N, — >•, Q) 
as — We now arbitrarily fix a configuration xq as the initial configuration and 
a scheduler A. A computation of — >■ under ^ is a (possibly infinite) sequence of 
configurations xq,Xi, - ■ ■ , such that a;o is the initial configuration and Xi Xi+i 
for all i > 0, where Si is the set of processes selected by A at i-th step. Such 
a computation of — >■ under A is not deterministic because of the existence of 
probabilistic actions. The computation tree associated with — ^ under A starting 
from Xo, denoted by T{A,xq), is a rooted tree such that: 



1. Xo is the root, 

2. every directed path starting from the root corresponds to a possible compu- 
tation of — >■ under A, and 

3. every vertex v is labeled with the probability 7r(u) that — >■ under A follows 
the path connecting Xq to v. (If the path connecting Xg to v is of the form 

Xk, then 7 t(z;) = YtiZoPi^j ^ xj+i).) 



Xo 



Xi 



2.2 yj- Algorithms 

We consider now 1^9- algorithms, i.e., distributed algorithms for which there exists 
a measure tp over configurations that never increases, whatever the randomized 
actions do. This situation is typical of mutual exclusion algorithms, where ip 
counts the number of tokens {p never increases as tokens are never created 
during algorithm computations, and decrease when tokens collide.) This is also 
the case of Israeli- Jalfon directing protocol (see PEI, Section 4.1) where p is the 
number of non-directed edges, or Dolev-Israeli-Moran leader election protocol 
(see 0, Section 5.3) where p is the number of processes that hold 1 in their 

s 

leader variables. Formally, a measure p from A to N is non-increasing iff: x — > y 
implies p{y) < p{x), for all x,y G X and every subset S of enabled processes. 

For a (^-algorithm, we define the set £ C A of p-legitimate eonfigurations as: 
£ = {x S A I p{x) < c} where c is an integer constant. Since p is non-increasing, 

it is easy to show that £ is closed, i.e: For any (/^-legitimate configuration x G C, 

s 

any configuration y G X and any set S of enabled processes, x — > y implies 
yGC. 
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We assume furthermore that there is no deadlock (Vx G X £{x) yf 0). 
In practice, for mutual exclusion algorithms, the no deadlock property often 
means that there exists always at least one token (Va; > 1). The set C 
of (/9-legitimate configurations then actually corresponds to configurations with 
exactly one token (i.e.: cc G £ iff (p{x) = c = 1). 

Given a scheduler A, we are interested in proving the following convergence 
property (see j I didj 1 : No matter which initial configuration one starts from, the 
probability that — > under A reaches a (/9-legitimate configuration in a finite num- 
ber of transitions is 1. Formally, let be the tree constructed from T{A,X[)) 
by cutting the edges going out of vertices corresponding to (^-legitimate config- 
urations of £; let Leaf{C) be the set of leaves of T^. The convergence condition 
claims that for all xq G X: J^veLeaf(C) ~ 

Expression J^v^Leaf(c) ’’’(^) probability of convergence of ^ under A, i.e., 

the probability, starting from xq, to reach a (/ 9 - legitimate configuration in a finite 
number of transitions. It will be abbreviated as: Pr(xo *£), or sometimes 
more simply as: Pr{xo —>*£). 



3 Randomized Algorithms as Markov Chains 

We assume in this section that scheduler A is given and is memoryless, i.e: 

for every sequence of transitions Xq — ^ xi xg, A deterministi- 

cally selects a subset of enabled processes of Xi, which depends on X£ only 
(not on the previous transitions). Since there is no deadlock. Si is nonempty, 
and the modification of letters at positions of S( randomly changes xi into 
a set of possible configurations x\j^-^,x1j^-^, - ■ ■ , with associated probabilities 
p{xi — >■ xl^j^),p{xi — >■ x^j^f),--- of sum 1. More precisely, given x € X, con- 
sider the set C of all the couples (y,p) of X x [0, 1] such that x y under A 
with probability p. Then, for all x G X, A‘{y,p)^i,, p = 1. So the computation 
behaves exactly as a Markov chain m- Therefore the classical Markov property 
holds: whatever the initial configuration we start from, the probability to reach 
a recurrent configuration in a finite number of steps is 1. We exploit this result 
for proving the convergence of p- algorithms towards C, using the fact that, un- 
der a certain condition, each recurrent configuration is (^-legitimate. Let us first 
recall the notions of “recurrence”, “transience” and Markov basic theorem in 
our context. 

Definition 2. A configuration x is transient iff^y x —>*y A ~'{y -A-*x). 



Definition 3. A configuration x is recurrent iff x is non transient, i.e: 

Vy x—>*y^y—^*x. 

The set of recurrent configurations is denoted TZec. 

Correspondingly, we define as a tree constructed from T{A,xq) by cut- 

ting the edges going out from vertices corresponding to recurrent configura- 
tions of TZec. Let Leaf (TZec) be the set of leaves of The expression 
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'^v(^Leaf(nec) probability, starting from xq, to reach a recurrent con- 

figuration in a finite number of transitions. It will be abbreviated as 

Pr{xo A-*nec). 

Theorem 4. (Markov) Given a scheduler A, we have: for every configuration 
X, Pr{x -^*TZec) = 1. 

We now assume given a non-increasing measure ip on X, and exploit Markov’s 
theorem. 

Definition 5. A configuration x is (^-transient iff 3y x —^*y A if{y) < if{x). 

Lemma 6. For every configuration x, if x is ip-transient, then x is transient. 

Proof. Suppose that x is (^-transient. Therefore x -A* y A if{y) < if{x) for some 
y. Let us show that x is transient by proving by reductio ad absurdum that 
-<{y — x). li y -A* X, then ip{x) < ip{y) (since ip is non-increasing), which 
contradicts ip{y) < ip{x). □ 

Theorem 7. Given a scheduler A and a non-increasing measure ip, we have: if 
each configuration x with ip{x) > c is ip-transient, then: \/x Pr{x -^*C) = 1. 

Proof. Any non (^-legitimate configuration x is such that ip{x) > c. So, by 
assumption, any non (/j-legitimate configuration x is (^-transient. Hence, by 
Lemma 0 x is transient, i.e., non recurrent. So: -•£ C —tR.ec. Hence: Rec C C. 
Now, by Markov theorem, Pr{x -^*Rec) = 1. It follows: Pr{x -^*C) = 1. □ 

Let us now give a local condition, called “Prop”, ensuring that every non- 
legitimate configuration is (^-transient. This condition is local in the sense that 
it involves only one-step reduction — )> (instead of —>■*). 

We assume given a scheduler A and a non-increasing measure ip. For every 
value of ring length N , we assume given a measure D from X — to a finite 
set A and an ordering <C over A. We then define a binary relation <l over X as 
follows: 

Wx,y G X x<y <t4> p{x) < p{y) V D{x) D{y). 

Since < and <C are orderings, <1 is itself an ordering over the finite set X = . 

Local condition Prop is defined as: 

\/x {p{x) > c =k 3y {x y A X <\ y). 

It says that, for each configuration x (of (/j-measure > c), there exists a transi- 
tion going from a; to a configuration y smaller w.r.t. <1. As there is no infinite 
decreasing sequence for <l (since A is finite). Prop ensures the (/j-transience of 
each non- legitimate configuration. Therefore by Theorem Q it follows: 

Theorem 8. Given a scheduler A and a non-increasing measure ip, we have: 
if, for all N, there exist a measure D and an ordering <C such that Prop: 

Va; {p{x) > c =k 3y {x y A {p{x) < p{y) V D{x) <C D{y)))) 
holds, then: Va: Pr{x ^*C) = 1. 
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Given A and (p, the crucial part of the convergence proof consists now in 
finding appropriate D and ^ that satisfy Prop. In practice, D and ^ will not 
be defined specifically for each ring length N , but in a generic manner with N 
as a parameter. 

Example 9. We apply here Theorem for showing that Herman’s algorithm is 
convergent. It is easy to notice (see fD]) that, starting from a configuration 
with an odd number of processes, — > has no deadlock. Given a configuration 
X = 9o9i ■ ■ ■ <lN-i we say there is a token at process f if = qi-\. The measure 
ip counts the number of tokens of a configuration x, i.e: = card{{i G 

{0, • • • , — 1} I gi = gi_i}). It is also proved in cni that If is non-increasing. 

The set £ of (/j-legitimate configurations is defined as the set of configurations x 
with at most one token {<f{x) < c = 1). 

For configurations x with at least two tokens, we consider the same measure 
D as the one introduced by Herman m, i.e. the minimal distance between 
two consecutive tokens of x. The corresponding ordering <C then coincides with 
<. For example, for x = OlOlOOlOllOlOlO, there are three tokens (represented 
in bold font), and the minimal distance between them is D(x) = 4. In the 
following, we assume that u and v are strings of Q*, a and b are distinct elements 
of Q. If X has a pair of adjacent tokens (i.e., x is of the form uaaav), then 
D(x) = I. Otherwise, for some n > 0, x may be of the form uaa{ba)'^~^baav 
with D{x) = 2n + l (odd case) or of the form uaa{ba)'^bhv , thus D{x) = 2n + 2 
(even case). 

Let us show that, for all x with at least two tokens (v?(a;) > 1), there exists 
y such that x ^ y with (f{y) < (f{x) or D{y) < D{x). There are three cases: 

— If D{x) = 1, X is of the form uaaav and x — >■ j/ = ubabv with (f{y) = f{x) — 2. 

— If D{x) = 2n+ 1 with n > 0, x is of the form uaa{ba)'^~^baav and 
X — >■ y = uba{ab)'^~^abbv with D{y) = 2n < D{x). 

— If D{x) = 2n + 2 with n > 0, x is of the form uaa{ba)^bbv and 
X — >■ y = uba{ab)^aav with D{y) = 2n + l < D{x). 

By Theorem 0 it follows that: Vx Pr{x — >*£) = 1. This proof is much simpler 
than the original one of EHl- Many other examples can be treated exactly along 
these lines (e.g., Beauquier-Delaet |2|, Flatebo-Datta |0|). 

4 Randomized Algorithms under Arbitrary Scheduler 

Let us consider now the case where the scheduler A is not memoryless, but 
has unlimited resources, and chooses the next enabled processes using the full 
information of the execution so far. In such a case, for the same configuration 
X, the selection of enabled processes can vary with the different sequences of 
transitions that led to x. The computation under such a scheduler A does not 
behave any longer as a Markov chain. Property Prop of Theorem 0 (dependent 
on a specific scheduler) must be strengthened in order to take into account all 
possible schedulers: see Prop’ below. For the sake of simplicity, we focus on the 
case where the scheduler is central (the selected subset of enabled processes is a 
singleton). The counterpart of Theorem|B|is the following: 
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Theorem 10. Given a non-increasing measure ip, suppose that there exist D 
and <C such that: 

Prop’: Va; Vz S £(x) ((p(x) > c => 3?/ (x A- y/\{tp{x) < (p{y)\/ D{x) <C D{y)))) 
Then, for any central scheduler A: Vx Pr{x ~^*C) = 1. 

The proof is given in appendix A. The result extends to the distributed case 
in the natural way (by replacing £{x) with Theorem E3 can be seen as 

a restricted version of Theorem 1 of jS| (see also theorem 5 of |Zj). In 0, we 
prove the convergence of Kakugawa-Yamashita’s algorithm using Theorem 
uni For lack of space, we present below an application to a simpler algorithm. 

Example 11. Let us consider Israeli- Jalfon’s algorithm m- The scheduler is cen- 
tral and arbitrary. A minor difference with the framework presented in Sec- 
tion IQ is that each guard takes into account not only the state qi-i of the 
left neighbour of Pi, but also the state qi+i of the right neighbour. Also each 
command modifies not only the state qt of Pi, but also the states of the left and 
right neighbours. (There is no conflict between transitions because the scheduler 
is central, and only one enabled process Pi is selected at each step.) The set of 
states is Q = {0, 1}. The transition algorithm — is: 

IF = 111 THEN g'_iQ'g'+i = 101 

IF qi-iqiqij^i = Oil THEN = 101 with probability 1/2 

or 001 with probability 1/2. 

IF qi-iqiqij^i = 110 THEN = 101 with probability 1/2 

or 100 with probability 1/2. 

IF qi_iqiqij,_i = 010 THEN = 100 with probability 1/2 

or 001 with probability 1/2. 

Given a configuration x we say there is a token at process i \i qi = 1. The 
measure ip counts the number of tokens of a configuration x. We focus on initial 
configurations x with at least one token {ip{x) > 0). Then all the subsequent 
configurations keep always at least one token and — >■ has no deadlock. It is also 
obvious that ip is non-increasing (no ‘1’ is created). The set C of (/^-legitimate 
configurations is defined as the set of configurations x with at most one token 
((/j(x) < c = 1). 

Given a configuration with a fixed number of tokens, say k > 1, we 
consider the measure D that maps any configuration x with the fc-tuple of 
distances between two tokens ordered by increasing order. For example, for 
X = 000110001010, fc = 4 and D{x) = (1,2, 4, 5). 

Let us show that, for all x with at least two tokens (ip{x) > 1), and every 
position i of token (or ‘1’) in x, there exists y such that x A ?/ with (p{y) < ip{x) 
or D{y) <C D{x), where <C is the lexicographic order. There are four cases: 

— If qi-iqiqi^i = 111, then x is of the form ulllu and x A y = rtlOlx with 

ip{y) = <p(x) - 1. 

— If qi-iqiqi-i-i = Oil, then x is of the form uOllx and x A ?/ = wOOlx (with 

probability 1/2) with (p(y) = (p(x) — 1. 
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— If = 110, then x is of the form itllOr; and x ^ y = mIOOz; (with 

probability 1/2) with (p{y) = <^{x) — 1. 

— If qi-xqiqij^i = 010, then x is of the form itOlOf. Let ^ (resp. r) be the 
distance between the token at position i and the closest token at his left 
(resp. right). Such a left (resp. right) closest token exists because <^{x) > 1. 
(Note that the left and right closest tokens coincide if </s(a;) = 2.) If £ < r 

then a; A- y = ulOOu (with probability 1/2) with D{y) <C D{x). Otherwise, 
a; A y = uOOlri (with probability 1/2) with D{y) <C D{x). 

By Theorem [TUI it follows that, for any scheduler A'. Vx Pr{x A-*£) = 1. 

5 Expected Time of Convergence by O-Lumping 

We assume again in this section that scheduler A is given and is memoryless. So 
the computation via — >■ under A behaves exactly as a Markov chain ^1|. Let k 
be an integer greater than c, Xk the set of configurations of (/j-measure k, and 
the set of configurations of measure less than k. Let be the subset of 
configurations of Xk of O-measure d, and Ak the image of Xk via D. We have: 
Xk = {x ^ X \ (p{x) = fc}. 

^<fe = {x € X \ if{x) < k}. 

Xf = {x G X \ if{x) = k A D{x) = d}. 

Ak = {d & A \ 3x & Xk D{x) = d}. 

Note that is closed under — >■ (since qr is non-increasing). We assume: 

Vx G Xk 3y G X^k X -A* y. 

This means that for every configuration x with k tokens (k > c), there exists a 
computation that goes from x to a configuration with less than k tokens. Con- 
dition Prop of Theorem 0 is a sufficient condition that guarantees the existence 
of such a computation. We want now to get quantitative information about the 
expected time of convergence of -A under A from xg € Xk to X^k- Formally, 
given Xg G Xk, let be the tree obtained from T{A, xg) by cutting the edges 
going out from vertices corresponding to configurations of X<^. Let Leaf(X^k) 
be the set of leaves of . We are interested in computing (an upper bound 
for) the expected time of xg to reach X^k, that is J2v&Leaf{x^k) ^(x)Tr{v) 
where 6{v) and tt{v) are the depth and probability of v in respectively. 

This will be abbreviated as E{xq ^*X^k), or more simply as: E{xq — t*X<fe). 
We now explain how to compute this quantity, under certain conditions, by 
“lumping” using measure D. In the Markov theory, it is indeed common to lump 
together configurations, in order to get a smaller Markov chain which gives in- 
formation about the original chain. Given x € Xk and e G Ak, consider the 
expression: 

^{x,e) = J2yex-P(^ y)- 

This is the probability of moving via -A under A from an element x of Xk into 
the set X^ of configurations of I?-measure e. Likewise, consider the expression: 
= Y.y^x^t,p{x -A y), 

where T is a new symbol. This is the probability of moving via -A under A from 
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an element x of Xk into set of configurations of (/^-measure less than k. 
Finally, let: 

This expresses the fact that X^k is a set which once entered is never left. We 
say that Markov chain — >■ is D-lumpable if, for all x G X^ and all e G Zife U {-L}, 
probability ^{x, e) depends only on the Z3-measure d of x, i.e: 

Vd G zifc Ve G Z\fc U {_L} Vcc, x' G X^ ^{x, e) = ^{x' , e). 

We then write such a probability ^(d, e). Given a Z)-lumpable transition system 
— the associated D-transition system, written is the binary relation over 
A). U {_L} defined as follows: 

— for all d, e G Ak, d'^ e with probability ^(d, e). 

— for all d G Ak, d _L with probability ^(d, _L), 

— _L _L with probability ,^(_L,_L) = 1. 

By Markov theory, if — >■ is a lumpable Markov chain, then the lumped transition 
system is also a Markov chain. The D-computation tree associated with 
starting from do G Z\fc U {T}, denoted by U{A,do), is a rooted tree such that: 

1. do is the root, 

2. every directed path starting from the root corresponds to a possible sequence 
of transitions via and 

3. every vertex w is labeled with the probability tp{w) corresponding to the 
path connecting do to w. (If the path connecting do to w is of the form 
do^ di'^ ■ ■ ■ df,, then = 11^=0 

Let t/-*- be the tree constructed from C/(M, do) by cutting the edges going 
out from vertices corresponding to T. Let Leaf{l) be the set of leaves of U-^ . 
Let ipiw) and e{w) be the probability and the depth of w respectively, for 
every vertex w in The expression X)iuGLea/(_L) ^('^)^(^) expected 

number of D-transitions, starting from do, to reach T. It will be abbreviated 
as: iffc(do T). We now explain, by Markov theory, how to compute this 
quantity and how it relates to E{xq —>*X^k)- 

The D-transition matrix is the square matrix of size {\Ak\ -k 1) having ^(d, e) 
on the row and column corresponding to d and e respectively, for all d, e G 
Z\fcU{T}. 

Definition 12. An absorbing element a is an element such that, for all d G 
Ak U {T}, a d ^ d = a. The set of absorbing elements is denoted by 
Abs. A Markov chain is absorbing iff all recurrent elements are absorbing and 
conversely (TZec = Abs). 

Note that, by definition, all absorbing elements are recurrent {Abs C TZec). 
Furthermore, in our case, it is clear that T G Abs (since T T with probability 
1). Besides, from the assumption that: Vcc G Xk 3y G X^k x — >■* y, it follows 
by lumpability that: Vd G Ak d T (convergence towards T). Markov theory 
states the following (see [T^ . p.59) : 
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Theorem 13. (Markov2) For every absorbing Markov chain, we have: 
Ek{d'^* Abs) < oo. Furthermore, {Ek{d^* Abs))d^Ak = iA ~ where 

Qk is the matrix obtained by truncating the D-transition matrix onto the set of 
non-absorbing elements, X the identity matrix of size \Au\, and 1 is the column 
matrix made of \Ak\ elements 1. 

Besides, we have: 

Lemma 14. For the D-transition system, Abs = TZec = {-L}. So the D- 
transition system is an absorbing Markov chain. 

Proof. We have always Abs C TZec. Let us prove TZec C Abs by showing that if 
d ^ Abs, then d ^ TZec. Suppose d ^ Abs. Then d yf _L. Hence -i(_L d). On 

the other hand, by convergence towards _L, we have: d _L. Since d _L and 
-■(-L d), d ^ TZec. □ 

From Theorem El and Lemma El it follows: 

Corollary 15. For all d € A^, Ek{d _L) < oo. Furthermore, {Ek{d 
X))deAk = {I — Qk) ^1, where Qk is the matrix obtained by truncating the 
D-transition matrix onto Ak (i.e, removing the X-row and X-column). 

In addition, since the original Markov chain — >■ is D-lumpable, Markov theory 
says that the lumped D-transition system ^ is such that (see [1 4] 1 : 

Vd € Life Vx e E{x X<fe) = Ek{d _L). 

It is interesting to compute Ek{d '^* _L) rather than E{x — >■* X^k) because 
the transition matrix of the lumped Markov chain is much smaller than the 
matrix of the original chain. For example, in Herman’s example in case k = 2, 
the matrix of the lumped chain is of size [X/ 2 J while the original one is of size 
2^ . The computation of Ek{d _L) in Herman’s example is explained below. 



Example 16. Let us first show that the Markov chain corresponding to Herman’s 
algorithm is D-lumpable for k = 2. This is due to the invariance by rotation of 
the minimal distance of configurations. Formally let x be a given configuration 
with 2 tokens of minimal distance D{x) = d. Let i be the position of the first 
token, and j the position of the second one. We write x = {i,j). The distance 
d is min{i — j, j — i) where — is subtraction modulo N, and ranges over Z \2 = 
{1,2,--- , [N/2\} (as N is odd, [N/2\ = {N - l)/2). For 1 < d < [iV/2j, x 
moves either to Xq = {i,j),xi = {i X l,j X 1 ),X 2 = {i X 1, j) or X 3 = (z, j -I- 1) 
with equal probability 1/4. Therefore, f{x, d) = 1/2 (= p{x — >■ xq) Xp(x — >■ xi)), 
and f{x,dXl) = ^(x, d— 1) = 1/4. If d = 1, then ^(x, 1) = 1/2, ^(x, 2) = 1/4 and 
^(x,_L) = 1/4 (case where the 2 tokens collide). If d = (TV — l)/2, f{x,d) = 3/4 
and ^(x, d — 1) = 1/4. Given d and e, the value of each probability f.{x,e) is 
constant, whatever the choice of x S . Hence the lumpability property. 

Let us now explain the computation of the D-transition matrix Qk for k = 2 
in Herman’s example m- A 2 = { 1 , 2 , • • • , m} with m = N/2. Q 2 is the m x m 
matrix of components ^(d, e) (with d, e € A 2 ), of the form: 
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/1/2 1/4 \ 

1/4 1/2 1/4 



1/4 1/2 1/4 

V 1/4 3/4/ 

Note that, component ^(1,_L) = 1/4 is excluded from the matrix Q 2 , since the 
_L-column has been truncated. We then compute B 2 = {I — Q 2 )~^, which gives: 

/ 4 4 .... 4 \ 

4 8 8 ... 8 

. 8 12 . . .12 



\ 4 8 12 . . .4m/ 

By Corollary El we know that the result of applying B 2 to 1 gives a column 
vector of d-component 2d{N — d), for d G {1, • • • , lN/2 \ }. Therefore E 2 {d 
_L) = 2d{N — d). The maximal expected time corresponds to d = \_N/2\ = m, 
and is 2m{m -|- 1) ~ iV^/2. This corresponds to E{x — >■* £) for x G X 2 - We 
thus retrieves directly what Herman obtained in m in a more complicated way. 
Using this result, Herman explains how to infer an expected time [log /2 
for E{x — >■* £) in the general case where x G with k > 2 (see (EH!)- Israeli- 
Jalfon’s algorithm is analyzed similarly in appendix B. 



6 Final Remarks and Further Work 

We exploited Markov chains theory in order to simplify the proofs of convergence 
of randomized self-stabilizing algorithms and justify more formally their perfor- 
mance analysis. Our method relies on the existence of a non-increasing measure 
if over the configurations of the distributed system. Classically, this measure 
counts the number of tokens of configurations. It also exploits a function D 
that expresses some distance between tokens, for a fixed number k of tokens. 
Our first result was to exhibit a sufficient condition Prop that exploits (p and D 
in order to guarantee that, under a memoryless scheduler, every non-legitimate 
configuration is “transient” in the sense of Markov theory. Roughly speaking. 
Prop says that, under a given scheduler A, for every configuration x, there exists 
a transition of non-null probability that applies to x and decreases p or D. We 
extended this property Prop in order to handle arbitrary schedulers although 
they may induce non Markov chain behaviours. We thus retrieve (a particular 
case of) a result due to 0. We then explain how Markov’s notion of lumping 
naturally applies to measure D, and allows to analyze the expected time of 
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convergence of self-stabilizing algorithms. Let us point out that the crucial step 
for proving convergence of such distributed algorithms is still the discovery of 
an appropriate function D satisfying Prop or Prop’ (which may be intricated, 
see e.g., example of Kakugawa-Yamashita in 0 ), but our work identifies some 
weak properties of D which suffice to entail the algorithm convergence and 
justify the performance analysis as done, e.g., by Herman. We only treated the 
case of finite-state reading model of communication as well as ring topologies. 
The finite-state assumption is a basic requirement of our work, which seems 
difficult to be relaxed. On the other hand, we believe that our results apply 
to other network topologies than rings. In the future, we plan to compare our 
lumping-based analysis method with the scheduler-luck game technique of [Zj. 



Acknowledgements. We are most grateful to Joffroy Beauquier and anony- 
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Appendix A: Proof of Theorem 11 01 

Before proving this theorem, it is convenient to introduce the notion of “<l- 
decreasing sequence” and some related properties. 

Definition 17. Given x € X, a sequence a = {ii,--- ,ii) of process indices is 
said to he O-decreasing for x iff: 

X = Xq Xi xg for some xi, - ■ ■ ,xgGX such that 

— xi <i Xi-i <1 • • • <1 xi <1 xo, or 

— Xi <] Xi-i < • • • <1 Xi < Xq and (p{xi) < c, for some i {0 < i < i). 

For any integer the finite set of <\-decreasing sequences for x of length £ is 
written Deci{x). 

We will write henceforth x C with cr = (ii, as an abbreviation of: 

dxi, ■ ■ ■ ,xg X — >■ Xi A xi — >■ X 2 A • • • A x^_i — > xg G L. 

Lemma 18. Relation <J is an ordering over X = . Furthermore, there exists 

M > 0 such that every sequence of configurations decreasing for <\ is of length 
< M. 

Proof. First, it is easy to see that <3 is an ordering because < and <C are or- 
derings. Let us now show that, for m > M = |X|, any sequence Xq,--- ,Xm 
of elements in X cannot be ordered by O. By the pigeonhole principle, there 
exists i,j with 0 < i < j < m such that Xi = xj. Hence, we cannot have 

Xjn <i ■ ■ ■ Xj <i ■ ■ ■ <i Xi <\ ■ ■ ■ <\ Xq. □ 



Lemma 19. For M defined as above, we have: Mx G X Ma G DecM{x) x C. 

Proof. By definition, given x G X, for all tr = (*!)••• j*m) G DecM{x), there 
exist xi, • • • , xm such that x = xq x\ Xm with 

xm < xm-i <1 • • • <1 Xi < Xo, or 

Xi <\ Xi_i <3 • • • <1 Xi < Xq and ip{xi) < c, for some i (0 < i < M). 

But the first case xm <!• • • Oxi <xq is impossible according to LemmaEl 

So: y>(xi) < c, for some z (0 < * < M). Hence x □ 

Proof of Theorem ITol 

Given x G X and cr such that x -^ £, let Pr{x £) be the probability 
associated with x -^ £. Let us now show 

Pi : 3p g]0, 1] Vx G X Vcr G DecM{x) Pr{x C) > p, 

By iteratively applying Prop’, it is easy to show that, for all x G X and all G N*, 
there exists a O-decreasing sequence a G Decg(x). For all x G X, we know by 
applying Lemma Hi to X that, for all cr G DecM{x), Pr{x £) > 0. Since 
DecM{x) is nonempty and finite, we can define Px as the minimum of Pr(x £) 
when cr ranges over DecM{x). We have: Vcr G DecM{x) Pr{x C) > Px > 0. 
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Now by taking the minimum p of all the Px when x ranges over X, we get: 

Vx € X, Vcr e DecM(x) Pr{x A- £) > p > 0. Hence Pi. 

Observe that, by Prop’, for every configuration x and every scheduler A, 
there is a path in the computation tree T(A,x) along which the probabilistic 
actions make the successive configurations decrease for <3. So from Pi it follows 
that the probability, starting from x, to reach £ under A in M transitions is 
> p. This writes: Pr{x ^ £) > p. Alternatively, given a scheduler A and a 
starting configuration x, the probability of not being in £ is less than 1—p after 
the M first transitions. It is less than (1 — after the 2M first transitions, and 
so on. The probability of not reaching £ under A after £. transitions tends to 0 
as £ tends to oo. In other words: \/x G X, Pr{x ~^^£) = I. So, for any 

central scheduler A: Va; Pr{x^*£) = l. □ 



Appendix B: Expected Time of Israeli-Jalfon’s Algorithm 



As mentioned earlier, the scheduler in Israeli-Jalfon’s example is arbitrary and 
the algorithm does not behave a priori as a Markov chain. However in the case 
where there are only two tokens (fc = 2), one can suppose that the scheduler is 
memoryless, and always selects the same token, say A, since any move of the 
other one can be simulated by a symmetrical move on A. It is also easy to show 
that Markov chain corresponding to Israeli-Jalfon’s algorithm is U-lumpable 
for k = 2. The U-chain corresponds to a random walk, and the expected time 
E 2 {d T) corresponds to the expected time that one token at a distance d 
from the origin, animated by a random walk, meets that origin. For N = 2m -I- 1, 
we have: 



/ 0 1/2 \ 
1/2 0 1/2 



Q2 = 



V 



1/2 0 1/2 
1/2 1/2/ 



(I-Q2)-' 



/ 2 2 . 

2 4 . 

. . 6 



2 \ 



4 

6 



\ 2 4 6 ... 2m/ 



Then E 2 {d T) = d{N — d) with a maximum of m(m -|- 1) ~ {N/2)'^ 
when d = m. Using our general matrix method, we thus retrieve the quadratic 
complexity result found by Israeli and Jalfon m 3 . 
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Abstract. This paper studies the average hop count measure for vir- 
tual path layouts of ATM and optical networks. Routing in the ATM 
and optical network models is based on covering the network with sim- 
ple virtual paths, under some constraints on the allowed load (i.e., the 
number of paths that can share an edge). The hop count is the number 
of edges along the virtual path. 

Two basic results are established concerning the average hop count pa- 
rameter. The first concerns comparing the maximum and average hop 
count measures assuming uniform all-to-all communication requirements. 
We develop a rather general connection between the two measures for 
virtual path layouts with bounded maximum load. This connection al- 
lows us to extend known lower bounds on the maximum hop count into 
ones on the average hop count for network families satisfying certain 
conditions, termed non-condensingly contractable (NCC) graph families. 
Using this characterization, we establish tight lower bounds on the aver- 
age hop count of virtual path layouts with bounded maximum load for 
paths, cycles, and trees. 

Our second result is an algorithm for designing a virtual path layout 
of minimum average hop count for a given tree network with general 
(weighted) one-to-all requirements. 



1 Introduction 

This paper concerns the problem of designing efficient virtual path layouts on 
optical or ATM networks (see, e.g., [H tif 1 7I‘AI ir/’.S] ) . In the ATM model, the routing 
and message forwarding tasks on a given network are simplified by predefining 
a collection of “expressways,” or virtual paths (VP’s), which are simple paths in 
the network, and performing end-to-end communication over routes composed of 
a sequence of such VP’s (i.e., using the VP’s as basic segments within complete 
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routes) . An elegant formulation of the problem can be obtained by representing 
the VP’s formed on the given communication network G as a virtual graph H 
over the same set of vertices. Specifically, a VP connecting v and u in G is 
represented by an edge connecting v and u in the virtual graph H . The pair 
(H,P), where P is the collection of VP’s corresponding to the edges of H, is 
referred to as the virtual path layout (VPL) for the physical graph G. Each route 
in G can be viewed as a simple path in the virtual graph H . 

Various formulations of the VPL problem attempt to design a system of 
virtual paths which optimizes some parameters of the system while meeting some 
given communication demands between pairs of nodes and satisfying certain 
prespecified constraints. Usually, the problem is studied against one canonical 
pattern of communication demands, known as all-to-all communication, which 
requires communication between every pair of vertices in the network. 

Research on virtual path layouts has concentrated on optimizing two central 
parameters of conflicting nature. The first is the load of a physical edge, defined 
as the number of VP’s that share it. The upper bound on the load of an edge 
is termed the capacity of the edge. The maximum (respectively, average) load 
of a VPL (H,P), denoted £max{H, P) (resp., Cavg{H, P)), is defined as the 
maximum (resp., average) load of the edges in the network. These parameters 
determine the size of the VP routing tables, and reflect the traffic load on the 
links. 

The second parameter of interest concerns the hop count, namely, the number 
of VP’s occurring on the routes of the VPL, or equivalently, the number of edges 
in the corresponding path on the virtual graph H . Expressed in terms of H, 
the maximum hop count, denoted HmaxiH), can be viewed as the diameter of 
H, and the average hop count, denoted HavgiH), can be defined as the average 
distance over all vertex pairs in H. Equivalently, one may consider the total 
number of hop counts over all vertex pairs in H , termed the total hop count of 
H and denoted T~Ltot{H)- (Clearly, T~Ltot{H) = T~Lavg{H) ■ n{n — l)/2 for every 
graph H .) These parameters measure the (worst-case or average) efficiency of 
setting up the route, and also the overall (worst-case or average) delay incurred 
by the route in a model where the processing along VP ’s is negligible compared to 
the processing at the VP endpoints. See for a discussion of these parameters 
and their significance. 

A number of studies have tackled the VPL problem (cf. The 

problem of minimizing 'HmaxiH)-, the diameter of a virtual graph H, subject to 
a specified upper bound on the maximum load of the VPL, Cmax{H, P) < c, has 
been considered in the undirected case in I LSI 1412:111:111 HUTU . Conversely, the 
problem of minimizing the maximum load Cmax{H,P) over all VPL’s {H,P) 
with bounded maximum hop count, HmaxiH) < h, is studied in pi 1 14| . Mini- 
mizing also the average load Cavg iH, P) is considered in M- 

As links based on optical fibers are directed, and may have a different load 
in the two directions, it may be useful to consider a directed model as in [3 
E], rather than an undirected one. The problem of minimizing the diameter 
HmaxiH) of a virtual directed graph (digraph) H over bounded maximum load 
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VPL’s is considered in giving lower and upper bounds on the virtual diameter 
'Hmax{H) of a directed VPL (henceforth, DVPL) with a prespecified capacity c 
bounding the maximum load (considered as constant). 

This paper focuses on the average hop count measure Havg of a digraph, and 
explores the variant of the VPL problem which seeks to optimize this measure. 
(Actually, for convenience, we formulate our results in terms of the equivalent 
total hop count measure, HtotiH)-) Clearly, the upper bounds established in 0 
can be converted into ones for the average hop count HavgiH) of DVPL’s as 
well. However, the average distance may admit better bounds, and hence it is 
not a-priori clear how to obtain tight or near-tight lower bounds for 'Havg(H) 
over DVPL’s with bounded maximum load. In fact, to the best of our knowledge, 
no such nontrivial lower bounds were known so far. 

Our first contribution concerns establishing such lower bounds. In Section 0 
we develop a rather general and fundamental connection between the average 
hop count measure T-LavgiH) and the maximum hop count measure HmaxiH) 
for DVPL’s {H, P) with bounded maximum load. This connection allows us to 
extend known lower bounds on 'Hmax{H) into ones on 'Havg{H) for network fam- 
ilies satisfying certain conditions, termed non-condensingly contractable (NCC) 
graph families. Using this characterization, we establish tight lower bounds on 
the average hop count (or equivalently on the total hop count) of DVPL’s with 
bounded maximum load for a number of network families, including paths, cy- 
cles, and trees. 

So far, we defined the VPL problem considering an all-to-all pattern of com- 
munication demands. Another pattern of communication demands for which the 
VPL problem was studied is the one-to-all pattern, in which a single vertex must 
communicate with all other vertices imn . For the chain network, a duality be- 
tween the problem of minimizing the hop count knowing the maximum load and 
the one of minimizing the load knowing the maximum hop count, is established 
in El- A number of VPL optimization problems with one-to-all communica- 
tion demands are studied in m for the chain network, including the problem 
of designing a VPL with optimal average hop count under the one-to-all com- 
munication pattern, and a dynamic programming algorithm is presented for this 
problem. The “Open problems” section of m states the following: 

“The most immediate open problem is to generalize these results for 
arbitrary trees, a task which seems non-trivial, as far as the dynamic 
programming algorithms are concerned, due to the additional structural 
information that is attached to each subtree (which does not exist in 
chains).” 

Our second contribution, presented in Section 0 involves solving this open prob- 
lem. As in HH. our solution handles the more general weighted version of the 
problem, in which we are given a requirements vector lo such that for 1 < j < n, 
oji specifies the expected amount of traffic between vi and Vi . Now the weighted 
average hop count of G is defined by weighing the distances according to w. We 
also extend the solution in another way: instead of assuming a uniform constant 
capacity c on all links, we allow a somewhat more general capacity function 
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c : E M'*’, under the restriction that c(e) < cq for every link e, for some 
constant cq. 

2 Preliminaries 

A physical communication network is represented by an n-vertex strongly con- 
nected directed graph G = (V,E). The vertex set V represents the network 
switches, and the arc set E represents the set of physical directed links. 

For two vertices v,u & V va G = (V,E), the distance of u from v in G, 
denoted by da{v,u), is the length of the shortest path from v to m in G. The 
diameter (or maximum hop count) of a graph G is the largest distance achieved 
by a pair of its nodes, i.e., 

UmaxiG) = max {dG{v,u)}. 

{v,u)^V xV 

The total hop count of G is defined as 

'Htot(G) = ^ da{u,v). 

u,v^V 

Our main interest is in strongly-connected graphs, where the total hop count is 
always finite. 

For a family of graphs M, define 'Hmax{M) = u\&yiG(^M{'^max{G)} and 
T~Ltot{M) = maxGeM{ldtot{G)}. Let us now give two simple and general upper 
and lower bounds on 'Htot(G) for an arbitrary digraph G. (Throughout, most 
proofs are omitted from this extended abstract.) 

Lemma 1. For every n-vertex digraph G , — n < HtotiG) < (n^ — n^) /2. 

We note that the upper bound is attained by the unidirected cycle graph, where 
the diameter is n — 1, and each node is an edge node of a diameter path. 

One question to reckon with, is the connection between the total hop count 
of a graph and its diameter. An easy but useful result is the following. 

Lemma 2. For every n-vertex graph G, n ■ 'H'^g^^{G)/2 < 'Htot{G) < n^ ■ 

-HnraxiG). 

Let us now turn to defining the VPL problem and its relevant parameters. 
Given a network G = (V,E), we can assign to certain pairs of distinct vertices 
x,y GV a simple directed path (dipath) P(x,y), connecting x to y. Consider a 
set E' Q V xV containing every vertex pair {x, y) for which such a dipath P(x, y) 
is defined. We consider a new digraph iJ = (V,E'), called a virtualization of G. 
The path P(e) = P{x,y) in the original graph G associated with the arc e = 
(x,y) in P[ is called a virtual path (VP). Note that H is not necessarily strongly 
connected, but we limit our discussion to strongly connected virtualizations, 
unless stated otherwise. In our terminology, the pair (iL, P) is a directed virtual 
path layout (DVPL) on G. With each dipath Q = (e(, . . . , ej) in iL we associate 
a route in G consisting of the concatenation of P(e(), . . . , P(eJ). 
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Given a collection P of virtual paths and an arc e, let Tp[e] denote the 
collection of all virtual dipaths -P(e'), for e' G H, that contain the arc e, that 
is, Tp[e] = {e' G E' \ e G P{e')}. The load of an arc e of G is the number of 
virtual dipaths that contain it, i.e., l{e) = |Tp[e] |. The maximum load of a DVPL 
(H,P) is denoted by Cmax{H, P) = maxeg_E{Z(e)}. A DVPL (H,P) satisfying 
C-max{H,P) < c is referred to as a c-admissible directed virtual path layout (or a 
c-DVPL) of G. 

Given a graph G and a positive integer c, we denote by Virt{G, c) the set of all 
c-admissible virtualizations of G. We now define the minimal realizable diameter 
of (G, c) as the minimal diameter that can be achieved by a virtualization from 
Virt{G, c), i.e., 

T~^max (G, c) = min{'H max {H)\H G Virt{G,c)}. 

Similarly, define the minimal realizable total hop count of (G, c) as the minimal 
total hop count that can be achieved by such a virtualization, i.e., 

'Htot{G,c) = \ H G Virt{G,c)}. 

3 The Total Hop Count of Virtual Graphs 

In this section we establish upper and lower bounds on ’Htot{Gn,c) in certain 
graph families, including paths, cycles, and trees. 

One direction is easy; by a direct application of Lemma El we can upper 
bound the total hop count of any graph family in terms of its diameter, as 
follows. 

Lemma 3. For every n G IN and n-vertex graph Gn, 

ntot{Gn,c)<n^ -n max (G„,c). 

Our main goal in this section is to derive lower bounds matching the upper 
bound of the last lemma for various graph classes. This is achieved by develop- 
ing a general connection between the average hop count measure 'Havg{H) and 
the maximum hop count measure FLmaxiH) for DVPL’s {H,P) with bounded 
maximum load, for network families satisfying certain conditions, termed non- 
condensingly contractable (NCC) graph families. 

Let us first introduce the following definitions. Let G = (V, E) be a (directed 
or undirected) graph, and let {v,u) S if be an edge of G. The contraction of 
V to u is the operation of deleting v from G, and reattaching all its edges to 
u, namely, replacing every edge {v,w) with the edge (it, rc). (In the case of a 
directed graph, replace each arc (v,w) (respectively (w,v)) with the arc (it, ic) 
(respectively (w,u)).) Resulting multiple edges and self loops are removed from 
the graph. (See Figure □]) 

Let G = (V, E) be a (directed or undirected) connected graph, and let V' be 
a subset of V. A contraction of G to V is a function : V — >■ V' having the 
following properties: 
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Fig. 1. The contraction of v to u. 



(Cl) <^{v') = v', for every v' € V'. 

(C2) For every v' G V', the subgraph induced by is con- 

nected. 





Fig. 2. Contracting G to F' = {wi,W 2 ,W 3 } according to y>, where (p ^(vi) is the set 
enclosed by the dashed ellipse around Vi, for i = 1, 2, 3. Vertices of V' are darkened. 



For a contraction ip of G to V , define the graph piG) = {V ,E'), where 

E' = {{v' ,u') \ v' ,u' G V' ,3v G (y'),3u G s.t. (y,u) G E} . 

A contraction can be implemented by the following iterative process. Let fc = 0 
and Vo = V . While F \ Vfe is not empty, select (by some fixed rule) a node 
u G V\Vk adjacent to some node v G Vk, contract v to u (or u to v, according to 
the direction of the arc, if G is directed), and let Vk+i = Ffc Ulu}, and k = k+1. 
Note that the contraction process (and hence its outcome) is not unique, and it 
is determined by the specific selection rule used. The resulting contraction p is 
defined for this process as p(v) = v for every v G V , and p{u) = v for every 
u^V such that v is contracted to u (or u is contracted to v) during the process, 
as, for some k, v gV^ and u G V \ Vk- 

Now we consider the induced operation of contraction on virtual graphs. Let 
G = (V,E) be a connected graph, and let H he a virtualization of G. Let F' 
be a subset of F, and let p he a contraction of G to F'. The virtualization of 
El induced by p, denoted as V{p,H) = H^, is the virtualization of p{G) where 
there is a virtual edge from v' to u' , for v' ,u' G V , iff there is a virtual edge in 
H from some node in p~^{v') to some node in p~^{u'). 

The following example shows that is not necessarily a legal contraction 
of H (even if H is strongly-connected). 
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Example 1 . Let G be the five clockwise-unidirectional cycle graph, with V = 
{uo, ui, U2, U3, U4} ordered clockwise (see Figure Ota)). Let H he & virtualization 
of G, with virtual arcs {vi,Vij^2 (mod 5))? for 0 < i < 4 . (See Figure Olb).) 

Taking V = {^0,^3}, and (p(uo) = ‘P{vi) = vq, ip{v2) = ipiv^) = ip{vi) = 
U3, we get a legal contraction of G to V (see Figure Etc)), but the induced 
virtualization is not a legal contraction of H to V (the subgraph 
induced by is not connected (see Figure Etd))- I 




Fig. 3. (a) The graph G. (b) The virtual graph H. (c) piG), the contraction of G to 
V = {uo,U 3 }. (d) The induced virtual graph 



We are interested in a specific type of contractions, which preserve the legality 
of a c-DVPL on a given graph. For this we need to introduce some definitions. 
First, we generalize the definition of an induced virtualization by a contraction 
to an induced DVPL. Let G = (V, E) be a graph, and let (iL, P) be a DVPL of 
G. Let V be a subset of P, and let be a contraction of G to V . The DVPL of 
{H,P) induced by ip, denoted as V{p,H,P) = {P[^,P^), is the DVPL of p{G), 
defined as follows: H^p is the virtualization of El induced by p (as defined above); 
each arc of H^, P = {v',u'), is defined by a virtual arc oi H, e = (u,m), with 
V S p~^{v') and u G p~^{u'). If the dipath P{e) in G associated with e G H is 
vi,V2, ■ ■ ■ , Vk, then the dipath Pp{e') in p{G) associated with e' G Hp, is defined 
to be p{vi),p{v2)^ ■ . . , p{vk), omitting all self loops. P^ is taken to be the set 
of all such dipaths, P^, = {P^{e') \ e' G H^p}. (E.g., in the situation described 
in Figure 0 (b), there is a virtual arc (u4,t>i), whose associated dipath in G is 
{va^vq,vi)] the resulted arc (FigureEJd)) is {v^,vq), whose associated dipath in 
p{G) is (</?(r;4), :p(no), :<5(ui)), which is, omitting self loops, {v^,vq). Here Pp{e') 
identifies with e.) 

Next, we define two (different) arcs in a directed graph, G = (V,E), ei = 
(vi,ui) and 62 = (u2, ^2) as parallel, if, considering G as undirected (i.e., ignoring 
the directions of E), there is a path between vi and V2 that does not pass through 
ui or U2, and there is a path between ui and U2 that does not pass through vi 
or V2- The path connecting vi and V2 is allowed to be empty, in case vi = V2, 
and analogously for ui and U2- (See Figure 0 ) 

A contraction of G to V is called condensing if there exist two parallel arcs 
in G, (vi,ui) and (^2,^2), such that p{vi) = p{v2) yf p{u\) = p{u2)- Otherwise, 
p is said to be non- condensing. 
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Fig. 4. Parallel arc pairs are (x,u) and ix,y), (x,u) and {y,v), (x,u) and (v,u). 



Lemma 4. Let {H, P) he a c-DVPL of the graph G = (V, E). Let V' be a subset 
ofV, and let ip he a non-eondensing eontraetion of G to V' . Then is a 

c-DVPL ofp{G). 

Now we define a special kind of graph families: Let = {G„} be a 
family of n- vertex directed graphs. The family M is called non-eondensingly 
eontraetable (NCC) if there exists a family of n-vertex directed graphs M' = 
{Fn}i functions g(ji,c) and /(n, c) and an integer constant c > 1 such that the 
following properties hold: 

(PI) 'Hmax{Gn,c) < g{n,c). 

(P2) 'Hmax{Fn,c) > f{n,c). 

(P3) For every constant 0 < a < 1 and every c € N, there is a function p{a, c) 

s.t. 

/(an, c) > p(a, c) ■ g{n, c) for every n. 

(P4) For every graph Gn and for every subset V of V, there exists a non- 
condensing contraction p 
such that p{Gn) S M' . 

Intuitively, am NGC graph family is a family whose every graph has “many” 
pairs of nodes whose mutual distance is “close to” the graph’s diameter, or, 
in other words, the diameter is “approximately” achieved by “many” pairs of 
nodes. 

For such graph families, we can state and prove our main theorem, which is 
useful for deriving lower bounds on the total hop count of some graph families. 

Theorem 1. Let {G„} be an NCG graph family. Then 'Htot{Gn,c) = 0{nf ■ 
n max {Gm c)). 

Proof. Consider an NGC graph family M = Gn. By Lemma|3 Titot{GmC) = 
0{n^ ■ TtmaxiGm c)), SO it remains to prove the opposite direction, i.e., to show 
that the virtualization Hn that achieves TLtot{Hn) = T~Ltot{Gn,c) satisfies 

■Htot{Hn) = L2{n'^-iimax{Gn,c)). 

Let Cl, C 2 be constants such that 0 < C 2 < Ci < |. Define a = and for 

each c define c = min{l, }, where p(a, c) is the function specified for G„ in 
property (P3). 
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Let Hn be the virtualization of Gn attaining TLtotiHn) = 'Htot{Gn,c). Let 
5 = c-HmaxiGn, c) . Let X denote the number of pairs of nodes (u, v) in V whose 
mutual distance in satisfies d[j^{u,v) < (5. If a; < Ci • n?, then includes 

at least (^ — ci) • ^ pairs of nodes {u,v) such that d,H^{u,v) > 5, hence 

'Htot{Hn) > 

and we are done. So hereafter suppose x> c\ ■ v?. 

For each node v G define the ^-neighborhood of v in G{n) as 

N{v) = {u G H \ dH„(u,v)) < 6}. 



Cl 



” 2 



■nr. 



<:{GnjC) — ^2{jl ' nmax {.G n j ^ 



We call a node u close to v if u G N{v), and call a node v G Hn a congestion 
point if |fV(z;)| > C 2 n. 

Denote the number of congestion points in by m. For a congestion point 
V, an upper bound on |iV(i;)| is n, and for a non-congestion point v, |IV('c)| < C 2 U. 
So, noting that x < upper bound x by 

X < mn + C 2 n{n — m). 

As by assumption x is no smaller than cin^, we get that mn+C 2 n(ri—m) > C\n^, 
which yields 

^ Cl - C2 

m > • n = an. 

1 - C2 

So let us take V G V as an arbitrary set of an congestion points. Now we 
non-condensingly contract G„ to V , denoting by the resulting graph, and 
by Hem the result of Hn under the contraction. Since the contraction is non- 
condensing, Han is a c-admissible virtualization of Fan by Lemma 0 We also 
have 



HmaxiHan) > f{an,c) > p{a, c) ■ g{n, c) > p{a,c) ■ nmax{Gn,c). 

So in Han there is a pair of nodes {u',v') whose distance in Han is at least 
cO ^ c) -nmaxiGn, c) . Their distance in Hn is definitely no smaller. 
So we have two eongestion point nodes {u',v') in Hn whose distance is at least 
dH^(u',v') > p{a,c) ■ HmaxiGnjc). By definition of c we get dHniu'.v') > 3c- 
nmax{Gn,c). It follows that iL„ contains at least c^ • pairs of nodes, namely, 
the nodes in N{u') x N{v'), whose mutual distance is at least c • nmax{Gn,c). 
This implies that 

ntot{Gn,c) = HtotiHn) > cl-n^-C-iimaxiGn.c) = H^n"^ -UmaxiGn, c)) . | 

By establishing that paths, cycles and trees are NCC families, we get 

Corollary 1. 1. For the n-vertex path Pn, ntot{Pmc) = 0{n^ ■ nmax{Pn,c)). 

2. For the n-vertex cycle Gn, Htot(G„,c) = 0{n^ ■ nmax{Gn,c)). 

3. For any n-vertex tree T„, 'Utot{Tn,‘P) = 0{n^ ■ 2)). 
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Proof. We have to show that each of the above graph families is NCC. In each 
of the three cases, we take = Gn- For the n- vertex path P„, as by 0 

1 

n2c-i 

2 — ^ 4c 

we can take f{n,c) = g(n,c) = and p{a,c) = 

(2q,)1/(2c-1)/(8^) 

, so properties (P1)-(P3) hold. As for the last property, (P4), an 
appropriate contraction of the path is achieved by mapping each non-congestion 
point that has some congestion points on its left to the closest such point, and 
mapping each non-congestion point that has no congestion point to its left to 
the nearest congestion point on its right. 

For the n- vertex cycle C„, as by [S| 

1 1 

ri2c - /n\2S 

2 — ^maxy^ni^) ^ \2 ) ’’ 

we can take /(n,c) = g{n,c) = and p{a,c) = 

satisfying properties (P1)-(P3). An appropriate contraction of 
the cycle satisfying property (P4) is mapping each non-congestion point to its 
closest congestion point in the clockwise direction. 

For the n- vertex tree T„, as by |B| 

< 32-ni/3 , 

we take /(n, 2) = n^/^/2, g{n,2) = 32 • and p(a, 2) = a^/^/64, satisfying 
properties (P1)-(P3), and for property (P4), we contract the tree by taking some 
congestion point as the root of the tree and mapping each non-congestion point 
to the closest congestion point on the (unique) path connecting it to the root. 
In all three cases we have NCC families, so the result follows. | 

Finally, combining Corollary 01 with the bounds established in jSj for 'Hmax 
over paths, cycles and trees (as quoted in the proof of the above corollary), we 
get the following. 

Corollary 2. 1. For the n-vertex path, Pn, 'Htot{Pn,c) = 

2. For the n-vertex cycle, Cn, 'Htot{Cn,c) = 0(n^+5s). 

3. For an n-vertex tree, T„, T-Ltot{Tn,2) = ©(ns). 

4 Arbitrary Root-to-All Communication on Trees 

So far, we concentrated on the DVPL problem over the all-to-all pattern of 
communication demands, and gave global {combinatorial) bounds for the total 
hop count of admissible (capacity-restricted) virtual graphs. In this section we 
turn to algorithmic aspects, and consider the problem of finding a virtual c- 
admissible graph which minimizes the total hop count, as well as calculating 
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that minimal total hop count, for a given pair (G,c). This problem seems to 
be complicated in general, so following m we focus on studying it in a more 
restricted setting. Specifically, we give an algorithm for handling the (weighted) 
one-to-all version of the DVPL problem on trees. 

Formally, we are given an n-vertex (directed or undirected) tree G = (V, E) 
on vertices ui, . . . ,u„, rooted at v\, with all arcs directed away from the root. 
The capacity of each edge is specified by a capacity function c. It is assumed 
that the capacities are bounded by some constant cq G IN. (Alternatively, larger 
capacities can be allowed, so long as the number of different choices for link 
capacity remains constant; a nonconstant cq would make the complexity of our 
algorithm exponential.) We are also given a requirements vector to specifying, 
for every 1 < i < n, the expected amount Ui of traffic between v± and Vi. We 
call such a pair (G, u) of a graph and a requirements vector a requirement pair. 

For the pair (G, tu), the uj-total hop count is defined by taking into account 
the relevant communication requirement of each pair of vertices, namely, 

'HtotiG) = da{vi,Vi)-uj,. 

2 < 2 <n 

Define as the minimum tu-total hop count that can be achieved by 

a c-admissible virtualization of G. We want to calculate 'H^^^{G,c), and find a 
c-admissible virtualization of G that achieves it. 

In order to simplify the presentation, we handle first the special case where 
the requirements in w are boolean, namely, uji G {0, 1} for every 1 < j < n. 

For the boolean variant of the problem, the vertices with which the root 
needs to communicate are called the required destinations, and are denoted by 
Q{u>) = {v^ I uju = 1}. 

It turns out that this restricted variant of the problem can be solved in 
polynomial time. For a given tree G and a set Q(uj) of required destinations, 
the best virtualization is found by dynamic programming. For this, we would 
like to evaluate the contribution of a subtree to the total hop count. Let the 
vertices of G other than the root be U 2 , . . . , organized in breadth-first order, 
i.e., satisfying that if Vi is the parent of Vj, then i < j. Each vt defines a subtree 
of G, denoted Ti, whose set of vertices V) includes Vi and all the vertices below 
it in the tree. We also denote by Ci the arc entering Vi. For some virtualization 
H of G, and some subtree Ti of G, we define the internal total hop count of Ti 
with respect to H as 

nt,i{H,T,) = Y. dH{v,,v,). 

Vj eVinQ(uj) 

We also define the minimal required total hop count of Ti as the minimum over 
all such virtualizations as 

= Tmn{Hti{H,T,) \ H G Virt{G)}. 

As usual, the dynamic programming algorithm is based on solving many small 
subproblems gradually. A typical subproblem T(Ti,d) to be solved during the 
algorithm is defined as follows: 
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Fig. 5. A typical subproblem to be solved. 

Input: A subtree Ti rooted at Vi and a tuple d = {di, . . . , dk) of k non-negative 

integers. 

Intuitively, the tuple represents the starting points of fc > 0 virtual arcs 
A = {A\, . . . , Ak) entering Ti. Each arc Aj starts at some node Xj on the path 
between v\ and Vi. The input component di represents the distance in H of its 
start vertex Xj from vi, namely dj = dn(vi, Xj). This value is henceforth referred 
to as the tail-length of Aj. The end vertex of each of the k arcs is some (unknown) 
vertex yj of Ti (see Figure 0. Hence these virtual arcs are not yet completely 
specified, as their end-points yj are to be chosen by the solution devised for the 
subproblem W. It is convenient to refer to the virtual arcs of A throughout the 
following discussion, but the reader should bear in mind that these arcs are only 
implicit in the algorithm, and only the d vectors are manipulated explicitly. 

Output: A value f{Ti, d), which is a non-negative integer or oo. 

For a subtree Ti rooted at Vi and a tuple Y = {yi, ... , yk) of end vertices of 
the virtual arcs, we denote by Hy the virtualization of Ti including all internal 
virtual arcs (i.e., arcs of H downwards from vertices of Ti), and the k paths from 
vi to each yj of lengths dj 1, for I < j <k. Let Y* denote the optimal tuple 
Y , such that the internal total hop count , Ti) of Ti is minimal over all 

virtualizations including k paths from vi to some vertices in Ti of lengths dj -\-l. 
The output of ^{Ti, d), denoted by f{Ti, d), is , Ti) if this can be done, 

or 00 if the instance {Ti, d) is infeasible. 
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We calculate c) (and find the virtualization that achieves it), by solv- 

ing all the subproblems WlTijd), namely, finding f(Ti,d) for every Ti and every 
possible tuple of tail-lengths d. This is done according to the dynamic program- 
ming paradigm. We create a table F of n — 1 rows and " columns. The 

columns are grouped into sets Fq,Fi,. . . , Fc^, where Fk consists of columns. 
These columns are used to store all possible outputs f{Ti, d). Row i of the table 
stands for the subtree Ti rooted at the vertex vt. Note that the rows are ordered 
0, 1, . . . , n — 1, so if Ui is a parent of Vj, then Tj’s row is above T^’s row. The en- 
tries of this set of columns and of the ith row stand for the class of subproblems 
in which k virtual arcs enter Ti. 

As dfiivijXj) < n — 1, for every Xj G V and every virtualization F[ including 
a path from v\ to Xj, the fcth set has n* columns, each standing for one tuple of 
d. The procedure will store f{Ti, d) in the entry of the ith row and the column 
corresponding to d. 

Clearly, if /c = 0, and Ti includes any required destinations, then the subprob- 
lem F{Ti, d) for d = {) is infeasible, so the value for this entry is oo. Note that if 
the (physical) arc reaching Vi is e^, then the subproblem F{Ti,d) for \d\ > c{ei) 
is also infeasible, so the columns in sets Fk for k > c(ei) (if there are any) are 
irrelevant, and should also be filled with oo. Note also that for each Ti there is 
no need to use more virtual arcs than the number of required destinations in Ti, 
but we do take such cases into consideration during the table filling. 

Note that if fc > 1 arcs reach Vi, then it is never necessary to terminate more 
than one of them at Vi itself. Hence we can extend some (perhaps all) of them to 
reach some internal vertices of Ti. The way we do it affects the needed internal 
total hop count. 

The algorithm fills the table in a dynamic programming manner, from the 
bottom up, filling each row by using the already filled ones below it. In the 
full paper we present the algorithm in detail, prove that it indeed calculates 
"H^((G, c), and bound its complexity, yielding the following theorem. 

Theorem 2. The algorithm described above calculates 'H‘^gt{G,c), and finds the 
virtualization that achieves it in polynomial time. 

The algorithm as presented assumes that the requirements vector u is 
boolean. We now describe how this algorithm can be extended to the general 
case. 

The notation and the algorithm are a natural generalization of those of the 
previous section, with some minor changes. Having a general communication 
requirements vector lu for the tree graph G, for some virtualization FI oi G, and 
some subtree Ti the internal total hop count of Ti with respect to H is defined 
with respect to to: 

'^tot{H,Ti) = ^ dH{vi,Vj) -UJj, 

Vj£Vi 

and the definition of 'Hf^fiTi) is modified accordingly. The dynamic paradigm 
calculations and the table filling are done as in the previous section, only filling 
leaf values with respect to oj: 
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f{Ti,d) 



{ 0, k = 0 and oju = 0, 

oo, k = 0 and oju ^ 0, 

mini<j<fc{(dj + 1) • uju}, 0 < k < c(ci), 

oo, k > c{ei). 



Also taking the communication requirements into consideration when calculating 
the function g: 



g{Ti,d,X) = {do + l)-cou + Y,fiTi,,d^{X)). 

i=i 

The correctness proof and complexity analysis of the algorithm remain the same, 
and so does its modification, which enables us finding the virtualization that 
achieves it. Therefore, we have the following. 

Theorem 3. The algorithm described above calculates and finds the 

virtualization that achieves it in polynomial time. 
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Abstract. We investigate the notion of Long Range Contact graphs. 
Roughly speaking, such a graph is defined by (1) an underlying network 
topology G, and (2) one (or possibly more) extra link connecting every 
node M to a “long distance” neighbor, called the long range contact of u. 
This extra link represents the a priori knowledge that a node has about 
far nodes and is set up randomly according to some probability distri- 
butions p. To illustrate the claim that Long Range Contact graphs are 
a good model for the small world phenomenon, we study greedy routing 
in these graphs. Greedy routing is the distributed routing protocol in 
which a node u makes use of its long range contact to progress toward 
a target, if this contact is closer to the target, than the other neighbors. 
We give upper and lower bounds on greedy routing on the n-node ring 
Cn augmented with links chosen using the r-harmonic distributions. In 
particular, we show a tight 6>(log^ n)-bound for the expected number of 
steps required for routing in Cn augmented using the 1-harmonic dis- 
tribution. Hence, our study shows that the model of Kleinberg m can 
be simplified by using the ring rather than the mesh while preserving 
the main features of the model. Our study also demonstrates the signifi- 
cant difference (in term of both diameter and routing) between the ring 
augmented with long range contacts chosen with the harmonic distri- 
bution and the ring augmented with a random matching as introduced 
by Bollobas and Chung 0. Finally, using epimorphisms of a graph onto 
another, for any network G, we show how to define a probability distribu- 
tion p and study the performance of greedy routing in G augmented with 
p. For appropriate embeddings (if they exist), this performance turns out 
to be 0(log^ n). 
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1 Introduction 

The small-world phenomenon arises from rather anecdotal experience that has 
been witnessed in many large interconnected systems: it is a phenomenon that 
formalizes the paradoxical ability of an entity in the system to be only a few 
“degrees” of separation away from any other entity in the system. This paradox- 
ical occurrence of the small-world phenomenon has been backed by statistical 
data of reachability and has several instantiations in the scientific literature from 
sociology to the web. It has become the subject of investigation in popular as 
well as artistic culture (see HZISIlBl l. 

To understand this phenomenon studies have been made that include the in- 
troduction of two graph theoretic models: relational graphs and spatial graphs. 
In relational graphs the probability of the vertices becoming connected depends 
only upon preexisting connections tiiaimn- In spatial graphs, the correspond- 
ing probability is a function of the vertices [TTirTlj . In recent years the web has 
been the focus of investigations. Here researchers have investigated power-laws, 
i.e., the probability that a node has degree k is given by k~^, for some constant 
c > 0; this implies that nodes with low degree are the most numerous and the 
probability of nodes with given degree k decreases as k increases proportionately 
with k' -°H2| . All these studies show that random graphs Qn,p as defined by 
Erdos and Renyi, are not good models for the small world phenomenon, because 
they have a large diameter when the average degree is small | 2 |. 

In this paper, we study the notion of Long Range Contact graphs. Let G = 
{V,E) be a network on n vertices. Consider a probabilistic mapping p on the 
vertices of G such that u gV. I.e., each node u G V 

has an associated probability distribution p(u, •). Given G and p, the Long Range 
Contact graph (G,p) is a directed graph defined on the same set of vertices, such 
that every node u has degQ(u) -I- 1 out-neighbors, that is its degQ{u) neighbors 
in G, plus one additional out-neighbor chosen at random according to p. This 
latter neighbor is called the long range contact of u. The probabilistic mapping 
p, i.e., the probability distributions p{u, -)’s, reflect “vague knowledge” available 
at the nodes about the possible status and location of a desired information 
located at some node of the network. 

In small world graphs, not only have the nodes a few degrees of separation, 
but these nodes are able (or expected) to find reasonably short routes between 
them. Therefore, the following two parameters have been the source of much 
research: (1) The diameter oi (G,p), i.e., the maximum distance between any 
two nodes in the augmented graph; and (2) The performance of greedy routing 
in (G,p), i.e., routing from a source s to a target t is executed by selecting, at 
each intermediate node u, the next node as the neighbor of u (including its long 
range contact) which is closer (in the graph G) to the target t. 

These two parameters depend first on the probability distribution to select a 
long range contact and second on the underlying topology of the graph. To be a 
good candidate to abstract small world phenomenon, a graph model must insure 
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Table 1. Expected number of steps of greedy routing in the ring augmented with long 
range contacts chosen according to the r-harmonic distribution. 
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Theorem E 


O(log^n) 


Theorem E 


1 < r < 2 


Q{n^^ ) 


Theorem E 


0(n"-i) 


Theorem C 


r = 2 


0{\/n) 


Theorem E 


/-j/ 71 log log 
logn 


-) Theorem E 


2 < r 


) 


Theorem 0 


0(n) 


Trivial 



that both the diameter, and the number of greedy routing steps, be small. In 
this paper, we study the model in which G is the ring Cn, and p is the harmonic 
distribution. 



Related research. Among the previously cited papers, two are strongly con- 
nected to this paper. Bollobas and Chung 0 have studied the diameter of a ring 
plus a random matching, selected uniformly among all possible matchings. They 
have shown that the resulting augmented ring has a diameter 6>(logn) with a 
probability tending to 1 as n goes to infinity. However, the performance of greedy 
routing can be very bad in this model. Indeed, Kleinberg m has shown that the 
ring augmented with long range contacts chosen uniformly at random offers very 
bad properties in term of routing (I2(-yn) lower bound for the expected num- 
ber of steps). As an attempt to model the small world phenomenon, Kleinberg 
has therefore proposed to use the 2-dimensional square grid augmented with 
long range contacts chosen according to the 2-harmonic distribution. He showed 
that, in this model, greedy routing performs in 0(log^ n) expected number of 
steps. Moreover he showed that this is optimal in the sense that for r ^ 2 any 
distributed routing algorithm based on the r-harmonic distribution has an 
lower bound on the expected number of steps. He concluded that the grid with 
the 2-harmonic distribution is a good model for the small world phenomenon. 



Results of the paper. Motivated by the research of Bollobas and Chung, we 
have investigated the augmented ring. Motivated by the research of Kleinberg, 
we have investigated r-harmonic mappings Pr, r > 0, defined as follows. Given 
two nodes u and v, the probability for u to have v as long range contact is given 

, where d(-, •) is the distance function in the network. 



d{u,v) 



hy Pr(u,v) = 

The uniform distribution (which is obtained for r = 0), i.e., p{i,j) = l/^i and 
the Zip/ distribution (which is obtained for r = 1 — log .80/ log .20), are two 
examples of harmonic distributions. We have performed an exhaustive study of 
the performances of greedy routing in the ring augmented with harmonic long 
range contacts, for all r > 0. Table n summarizes our results. 
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One important result in this table is the tight 0(log^ n)-bound for the ex- 
pected number of steps of greedy routing in the ring augmented with long range 
contact chosen using the 1-harmonic distribution. The upper bound 0(log^ n) 
shows that the simple ring can perform as well as the square mesh, and hence 
provides a simpler model for the small world phenomenon. The lower bound 
12(log^ n), as well as the other lower bounds for r yf 1, show that greedy routing 
cannot perform faster than log^ n steps in any ring augmented with an harmonic 
distribution. It seems to be a challenging task to prove or disprove the existence 
of a distribution allowing greedy routing to perform faster in the ring, the square 
grid, or even the A:-dimensional mesh, k > 3. 

As a last contribution, we show how to extend the results of the ring to 
any network G, by using epimorphisms of a graph onto another. In particular, 
we show how to define a probabilistic mapping p and study the performance of 
greedy routing in (G,p). For appropriate embeddings this performance turns out 
to be 0(log^ n). 

2 Preliminary Results 

For the purpose of simplification of the presentation, all our results are formally 
proven for the directed ring, i.e., the digraph in which nodes are labeled from 0 
to n, and where node i has node i -I- I as out-neighbor, and i — 1 as in-neighbor 
(unless specified otherwise, all operations are performed modulo n-l- I). In each 
case, the result in the undirected ring differs by a constant factor only. We denote 
by Rn+i the directed ring of u -|- 1 nodes. 

The r-harmonic random variable Hr, with values in {1, . . . , n} has the prob- 
ability distribution defined by Pr ({Hr = fc}) = where Hn'^ = X^r=i 
the r-harmonic number of order n. Therefore, if Rn+i is augmented using the 
r-harmonic mapping pr, then, given two nodes i and j, the probability for i to 
have j as long range contact in (Rn+i,Pr) is given hy Pr(i,j) = (b~d — _ 

This formula can be made more explicit by noticing that the harmonic numbers 
satisfy the following identities. 

Lemma 1. The r-harmonic number of order n is 

( + 0(1) ifr< 1; 

= < log n -k 0(1) ifr = 1; 

[0(1) ifr>l. 

The next lemma shows thresholds in the behavior of the harmonic distribu- 
tions. Not surprisingly, these thresholds are those appearing in Tabled 



Lemma 2. The expected value of Hr is 
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E{Hr) = < 



0{n) ifO<r<l 

0{n/logn) if r = 1 
0(71^-^) ifl<r<2 
0(1/ log n) if r = 2 
0(1) if2<r. 



For our analysis of greedy routing in Rn+i, we will always assume that the 
source node is 0, and the target node is n. It is indeed easy to observe that this 
is a worst case, as far as greedy routing is concerned. Indeed, the probability 
for a node to have a long range contact at distance d on the ring decreases as 
d increases. Therefore the farther a source is from a target, the larger is the 
expected number of steps to route from that source to that target. 

A very naive interpretation of LemmaQwould be to derive that, e.g., greedy 
routing in the ring augmented with the 1-harmonic distribution performs in 
O(logn) expected number of steps. This reasoning fails because the expected 
gain of using long range contacts decreases as one gets closer to the destination 
(as long range contacts may lead farther away from that destination than one 
currently is). The following clarifies that point. Given a node s G {0, . . . ,n — 
1}, greedy routing defines a random variable Jg as the length of the “jump” 
performed at s toward the target n. It satisfies: Jg = Hrii Hr < n — s, and — 1 
otherwise. One can easily show the following. 



Lemma 3. For k < n — s, we have 



Pr({ J, = k} = 



Pr({iJj. = k}) -b Pr{{Hr > n — s}) if k = 1 
Pr({iJj. = k}) if 1 < k < n — s. 



And the expected value of the jump Jg at node s in (Rn+i,Pr) is: 



E(J«) 



1 

1 

CM 

1 


if r < 1 


0((n — s)/logn) 


if r=l 


0{{n — s)^“”) 


if 1 <r <2 


0(log(n - s)) 


if r = 2 


0(1) 


if r >2. 



3 Upper Bounds 

We begin with general considerations which apply to arbitrary networks. Then 
we will refine these concepts for the specific case of the ring. For each vertex 
u of G = (V,E), and each real number r > 0, define the ball B^{u) of radius 
r around u as the set of vertices at distance at most r from u. (If the graph 
used is clear from the context we will omit the superscript G from B^{u) and 
write Br{u).) For any set S of vertices of (G,p) and any vertex u G V define 
p[u — >■ S'] = ^)- trying to quantify the weight that a node 

u gives to a contact in S in the sense that p[u — >■ S] is the probability that a 
node u has a long-range contact in the set S. 
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Definition 1. Let G be a graph, p a probabilistic mapping on G, c > 1 a con- 
stant, and f a function. The pair (G,p) is called an {f,c)-Long Range Contact 
graph if for any pair (u,t) of vertices of G at distance at most d we have that 
p[u 

Lemma 4. Let G = (V,E) be a graph of diameter D. If (G,p) is an 
{f ,c)-Long Range Contact graph then greedy routing in (G,p) performs in 

^ /(-D/c*)^ expected number of steps. 

Proof. What is the probability, for a given node u at distance at most d from 
the target t, that the long range contact selected is at a distance at most d/c 
from the target? By definition, this is equal to p[u — >■ Bfi/dt)]. Moreover, by 
the geometric distribution, the expected number of trials to guarantee success is 
1 /p[u — 7> . When a trial fails, we make a move towards the target by going 

to a neighbor along a shortest path from the current node to the target. The next 
trial is therefore performed at a node still at distance at most d from t. It follows 
from Definition n that the expected number of trials to get a contact in B^/dt) 
is at most < f{d). This implies that after at most f{d) expected 

number of routing steps from u, we enter Bj^/^(f). Iterating this we conclude 
that the expected number of steps for routing is at most ^ . 

Using specific probabilistic mappings we can simplify our analysis. 

Definition 2. A probabilistic mapping p on a graph G is distance-invariant if 
p{u,v) depends only on the distance d{u,v). A distance-invariant mapping is 
called non-increasing if it is a non-increasing function of the distance. 

To simplify notation we use the same symbol to denote the resulting mapping, 
namely p{u,v) = p{d{u,v)). We can prove the following result. 

Lemma 5. If p is a non-increasing distant-invariant mapping on the graph G 
then for all vertices u, t with d{u, t) < d and all constants c > 0, we have that 

p[u -)> Ba/c{t)) > p((c-k l)d/c^ ■ \Bd/c{t)\. 

Proof. Let u be a node in Bj^/i,{f). For any node u, d{u,v) < d{u,t) -\- d{t,v) < 
d-\- d/c= (c -h l)d/c. It follows that 

p[u^ Ba/c{t)\= ^ p{u,v) 

"eSd/cl*) 

= p(d(u, u)) since p is distance invariant 

v€Bd/c(t) 

> p((c -|- l)d/c) since p is non increasing 

t'6Sd/c(t) 

= p{{c + l)d/c) • \Ba/c{t)\, 
which completes the proof of the lemma. 
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As a direct consequence of Lemma 0 and by definition of (/, c)-Long Range 
Contact graphs, we obtain the following result. 

Lemma 6. Consider a graph G and a non- increasing distance-invariant map- 
ping p. Then, for any c > 1, the pair (G,p) is an {f,c)-Long Range Contact 
graph where the function f{d) is defined by f{d) = ^((^+ 1 )^/^). ■ 



Theorem 1. The expected number of steps for greedy routing on Rn+i is 

( 0(log^ n) if r = 1 
joK-i) ii/l<r<2. 

Proof. The r-harmonic mapping pr on a graph G is a non-increasing distance- 
invariant mapping. From Lemma 0 (G,pi) is a Long Range Contact graph 
with f{d) ~ 1/logn. The O(log^n) bound then results from the application 
of Lemma E] Similarly, for r > 1, G{,pfi} is a Long Range Contact graph with 
f{d) ~ dJ’~^ (cf. Lemma The result then follows by application of Lemma El 

In order to obtain non trivial upper bounds when either r < 1 or r > 2 we can 
use the method of probabilistic recurrences. First we recall the following discus- 
sion from 1 1 4] (Theorem 1.3, page 15). Let g{x) be a monotone non-decreasing 
function from positive reals to positive reals. Consider a particle starting from 
position 0 and moving along the discrete line segment from 0 to n and whose 
position changes in discrete time intervals. If the particle is currently at position 
s it moves to position s -I- X where X is a random variable ranging over the 
integers 1, . . . ,n — s such that E[X] > g{n — s). The following result due to 
Karp, Upfal and Widgerson was first stated in PI]| (see also |2| for additional 
information on probabilistic recurrences): 

Lemma 7. (Karp, Upfal, Widgerson [I Dp Let T be the random variable de- 
noting the number of steps in which the particle reaches the position n. Then 
EiT)<f;^dx/g(x). 

We can use Lemma 0 to analyze greedy routing when r < 1. More precisely, 
we can prove the following result. 

Theorem 2. The expected number of steps for greedy routing on Rn+i using 
r-harmonic distributions with 0 < r < 1 is 

Proof. Greedy routing is similar to the motion of the particle described above. 
By LemmaEl if the particle is in position s then the expected length of a jump is 
0((n— s)^“’'/n^“’'). If we let g(cc) = 0(x^“’'/n^“’') then LemmaQis applicable 
and we obtain that the expected number of steps of greedy routing is at most 
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The Lemma on probabilistic recurrences can also be used for analysing greedy 
routing when using 2-harmonic distributions. 

Theorem 3. The expected number of steps for greedy routing on Rn+i using 
2-harmonic distribution is 0( ^ ^°ogn ^ ^ ) • 

Proof. Combining Lemmas 0 and 0 we can show that up to a constant the 
expected number of steps of greedy routing is at most This is easily 

seen to be in 

'' log n ' 



4 Lower Bounds 

The proof of the following result is based on a proof in HH. 

Lemma 8. Let p be any distance-invariant mapping on i?n+i . Assume that 
there exists d and D, and e, 0 < e < 1, such that such that one of the two 
following conditions holds: 

1. d> D and D ■ ^ 

2. d - D < n and D ■ J2i>dP('^) — 

Then the expected number of steps of greedy routing is at least (1 — e)D. 

Proof. First we prove the lemma under condition 1. Let B denote the ball of 
Rn+i centered at n and radius d, i.e., B = {n — d, . . . ,n — 1, n}. Recall that we 
consider greedy routing from 0 to n. Consider the events: 

•E: In at most D steps we reach n. 

•E': In at most D steps we reach a node that has a long range contact to a node 
in B. 

•Ef. In step i we reach a node that has a long range contact to a node in B. 

Let X be the random variable which counts the number of steps to reach n 
from 0. In view of condition I we have that 

D d 

Pr{E') = Pr{ut,E') < ^Pr(if') < D ■ p[0 ^ B] < D ■ < e. 

i=l i=l 

It follows that 

Pr(E') = 1 - Pr(F;') > 1 - e. (1) 

Since d > D, E C E', and hence E' C E. It follows that Pr{E\E') = 0. Using 
this and Inequality 0 we can show that 

P[X]^^k-Pr{{X = k}) >^fc-Pr({X = fc}niU)=Pr(U')-E[X|U'] > 

k k 
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This proves the first part of the lemma. Next we prove the lemma under condi- 
tion 2. Consider the events 

•E: In at most D steps we reach n. 

•E': In at most D steps, we reach a node uq that has a long range contact to a 
node Uq^u such that d{uo, it,|) > d. 

•E[\ In step i, we reach a node uq that has a long range contact to a node Uq ^ n 
such that d(uo,uJ) > d. 

Again, let X be the random variable which counts the number of steps to 
reach n from 0. For every node u, let u~^ be the long range contact of u. Using 
Condition 2 of the lemma, we obtain 

D 

Pr(A') = Pr{ug,E') < ^Pr(A') < D ■Pv{{d{u,u+) > d}) =D-Y,P(^) < £• 

Since dD < n, E C E' , and hence E' C E. Therefore, E[A] > Pr(if') •E[A|£’'] > 
(1 — e)D. This completes the proof of the lemma. 



Theorem 4. The expected number of steps for greedy routing on Rn+iunder the 
r -harmonic distribution is bounded from below by (up to a constant): 



1-r 

n2-r ifr<l 
n if 1 < r 



Proof. The cumulative distributions of the r-harmonics random variable Elr are 
given (up to a multiplicative constant) by the formulas 



Pr({iJ^ < k}) 



{k/nY '' if r < 1; 
1 — k^~'^ if r > 1. 



When r < 1 we apply condition 1 of Lemma 0 with d = D = 
When r > 1 we apply condition 2 of Lemma 0 with d = n 
€= 1 / 2 . 



1 — r 

712-’- , and e = 1/2. 
D = ^ and 



In the specific case r = 1, one can prove the optimality of Theorem 0for the 
1-harmonic distribution. 



Theorem 5. The expected number of steps of greedy routing using the 1- 
harmonic distribution is at least l7(log^ n) . 



Proof. Let iJ be a 1-harmonic random variable in {!,..., n}, i.e., Pr({iL = 
i}) = l/(i • Hn) where = X^r=i 1/* = 0(logn). For any s, 0 < s < n — 1, the 

^ ^ — Greedy routing from 0 to n constructs 



jump at node s is Jg = 



1 otherwise. 
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a sequence sq = 0, si, S 2 , ■ • ■ such that s^+i = + Js^- From Lemma 0 we have, 

for any k € {1, . . . ,n — s}, 

r 7 = =k}) if 1< /c < n - s; 

^ ® I Pr({iJ = 1}) + Pr({iL > n — s}) otherwise. ^ 

and 

77 — o 77 — o 

E(J,) = Pr({iL > n - s}) + — < 1 + — . (3) 

For 0 < t < [log 2 nJ, let = n • (1 — 1/2*) and li = [nj,ni+i). Let z > 0, 
s C h-i, and Eg be the event that the long range contact of s is in [ni_|_i,n] 
(i.e., greedy routing from s to n “jumps” over li). We have Pr(if;,) = Pr({Jg > 
TZi+i— s}), and thus, thanks to Equation|3 Pr(i?s) = X]fc=ni+i-s = fc}) — 

IsIe log = isk log (l + • For s G J,_i, we have 2 - log 3 < 

log ( 1 + I < 1- As a consequence, 

y rti+i s j 

<“) 

Let K be the random variable defined as the number of consecutive first intervals 
containing at least one node s^, while performing greedy routing from 0 to n. 
More precisely, if greedy routing constructs the sequence Sq = Oj Si, S 2 , • . then 
K = min{j : Si ^ /j,Vz} — 1. From Equation 0 

By using ln(l + x) ^ x when x is small, easy calculations show that E(iF) = 
0(logn). Let us now concentrate on the time it takes to traverse an interval Ij, 
i < K. Let 



ti = minjsj : Sj G A} and = max{s^ : Sj € li}. 

Then let Ai = U — rii and Z\' = rii+i — t'. If ti = Sj and Sj-i G h, then 
Ai < Jnt = Jn{\-\l 2 ‘-) thus, thanks to Equation^ 

Ti 

E(Z\j) < E( < 1 + 2^^ • 

Similarly, 

Ti 

E(^i) < E(J„(i_i/2q) < 1 + 2i^ ■ 

Therefore, if z < iC, we get ^ = z — 1, and thus 

E(Ai) < 1 + and E(Z\') < 1 + 



(5) 
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Let Di = — We have Di = {ni+i—ni) — {Ai + A'^), and thus from EquationEl 



E(A) > \h\{l-6/Hr,). 



( 6 ) 



In the interval li, the long range contacts are at distance at most Jn^ = 
Let and let W be the stopping time for 

that is 

k 

Ni = min{fc| E 

i=i 

From EquationEl we have > E(Di) > — 6/i?„). On the other 

hand, by Wald’s Equation (see (Corollary 6.2.3)), we have E(^^^ X^*)) = 

E(W) • E(X(*)). Therefore, from Equation El we get 



E(X,) > 



|/,|(l-6/H„) 
1 + 2|/,|/Lf„ 



l7(logn). 



To summarize, the expected number of consecutive intervals li traversed by the 
greedy routing is l7(logn), and the expected number of steps to traverse each 
of these intervals is l7(logn). Therefore the expected number of steps of greedy 
routing is at least f?(log^ n). 



It is an open problem whether or not the lower bound of Theorem Elis valid 
under any distance invariant distribution on the ring Rn+i- However we note 
the following general result which is an immediate corollary of Lemma El 



Corollary 1. Let p be any non-increasing distance-invariant mapping on 

i?„+iand D < n/ A an integer such that 

( 0{D) n \ 

min<^^p(i), p{i)\<0 

2—1 i—Q{n/D') J 

Then the expected number of steps of greedy routing is in L2{D). 




5 Long Range Contact Graphs 

In this section, we show how to generalize the results obtained on the ring to 
arbitrary graphs. More precisely, we consider the issue of how to produce an 
appropriate probabilistic mapping p on an arbitrary graph G so that routing 
can be done in a small number of steps in {G,p). We begin with the class of 
/c-dimensional tori. 

Kleinberg m considers the two dimensional grid. We can generalize his 
result in the following manner. Consider the fc-dimensional torus Tjf: with n = 
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vertices, i.e., q vertices per dimension and fc > 1. It is clear that balls of radius 
d have size and spheres of radius d have size 0{d^~^). Moreover the 

diameter is D = 0{n^^^). Let us consider the r-harmonic distribution on the 
graph T^. For the r-harmonic distribution we have 



p{d) 



d~^ 



= 0 




( 7 ) 



Equation 0 indicates that we should select r = k. In this case we obtain that 
p{d) = 0{d~^/logq). In particular, using LemmaI3 (Tg,p) becomes an (/, c)- 
Long Range Contact graph, where 



/(d) = l/(p(3d/2).|B,/2(t)|) 

= l/(0((3d/2)-Vlog9)-(d/2)-'=) 




Since the diameter of the graph is D = 0{v}/^) we can use Lemma 0 to 
obtain the following result. 



Lemma 9. Let be the k- dimensional torus of dimension fc > 1 and n = q^ 
nodes, and let pk be the k-harmonic mapping. Then {T^,p}fj is an {f,2)-Long 
Range Contact graph, where f{d) = 0(^logn). Moreover, greedy routing in 
(T^,p) performs in O ^lalog^n^ expected number of steps. 

It follows from Lemma 0 that greedy routing can be performed in 0(log^ n) 
expected number of steps in the fc-dimensional torus T^, where k is constant 
and the probabilistic mapping is defined as before. Let us now present a tool to 
extend results on a greedy routing in a graph G to other graphs G' . First, we 
recall the notion of an epimorphism. 



Definition 3. Consider two graphs G = (V,E) and G' = {V ,E'). An epimor- 
phism of G onto G' is an onto mapping (j> : V ^ V such that {it, v} G E 
{4>{u) , 4>{v)} G E' , for all vertices u,v GV. 

Note that, if 4> is an epimorphism, then da> , 4>{v)) < da{u,v) for every 
u and V. Next we define the notion of distance maintaining epimorphism. 



Definition 4. Let a be a positive constant. An epimorphism (j) from the graph 
G = {V, E) onto the graph G' = (V' ,E') is called a-distance maintaining if for 
allu,v G V, do(u,v) < a-do'(if(u),(/>(v)). The epimorphism 4> is called distance 
maintaining if it is a-distance maintaining for some positive constant a. 
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It is not hard to see that if p is a probabilistic mapping on the vertices of G 
then p' is a probabilistic mapping on the vertices of G' , where 



p'{u\v') 



1 



(v') 



(8) 



Lemma 10. Assume that there is an a-distance maintaining epimorphism (f> 
from G onto G' . Let (G,p) be an (/, ac)-Long Range Contact graph. Then {G' ,p') 
is an if ,c)-Long Range Contact graph, where p' is defined in Eguation\El and 
f'(d) = f{ad) ■ max„/gy/ \cj)-'^{u')\. 



Proof. Let </) be a distance maintaining epimorphism from G onto G' . First of 
all observe that for any t, t' such that (fit) = t' we have that 

{t') = {'v' ■ dG'{v',t') < d'} = {(j){v) : dc {,4>i.v) , 4>{t)) < d'}. 

Therefore, B'^, {t') D (j){{v : daivjt) < d'/c'}) from the definition of epimor- 
phism. Hence B'f' ft') A (j){Bff,{f)). It follows that 

syfiit')^ U (f{Byf,{t)). (9) 



Let u' ,t' G V be vertices such that dcfa' ,t') < d' . From the definition of epi- 
morphism, there exist vertices uo,to G ^ such that 4>{uo) = u',(j){tQ) = t' . Then 
from the definition of distance maintaining, we have dciuojto) < a - dcfa' ,t') < 
a - d' . We have 



p'[u' ^ Bff,'/^{t')\= ^ p'{u',v') 

1 






1 

~W^\ 

1 

~W^\ 

Therefore, from Inequality El we get 

. rG' ^ 



E E E 

“'GS'f.' (t') ue0-T«') 

d'/c' ' 

^ p { uq , v ). 
%(*')) 



p'\v! ^ > I ,_w n, I] p{u^,v). 
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If follows that 

p'[u' > I ,_i/ /'ll Pb^o -t [J -B^/c(^)] 

~ |(^-l(u')| -®(ad')/(«c)(^o)] 

1 1 

“ /(ac?') 

> 1/ (/(ad') • max 

> l//'(d')- 

This completes the proof of the Lemma. 

Lemma El enables us to define new distributions on graphs. 

Theorem 6. Let G = (V, E) he any graph sueh that there is a distanee main- 
taining epimorphism <f> from a k-dimensional torus of size 0(n) onto G. Further 
assume that max„gy = 0(1). Then there is a probabilistic mapping p on 

G such that greedy routing in (G,p) performs in O(^log^n) expected number 
of steps. 

Proof. From Lemma ,Pk) is an (/, 2)-Long Range Contact graph, where 

/(d) = O(^logn). By application of Lemma El the probability p' defined in 
Equation El is such that {G,p') is an (/',2)-Long Range Contact graph where 
/'(d) < j3 ■ f{ad) for some constants a and f3. That is /'(d) = 0{^logn). It 
follows from Lemma 0 that greedy routing in {G,p') performs in O(^logn) 
expected number of steps. 



6 Conclusion and Open Problems 

In this paper we have studied the performance of greedy routing in the ring 
augmented with long range contacts chosen using r-harmonic distributions. We 
have also shown how to extend our results to arbitrary networks via appropriate 
mappings of multidimensional tori onto the network. Under certain conditions it 
is shown that greedy routing performs quite efficiently, i.e., 0(log^ n) expected 
number of steps. In particular, the ring augmented with the 1-harmonic distri- 
bution provides a simple model for the small world phenomenon. 

Several interesting problems remain. For a general network, can we define 
probabilistic mappings for which greedy routing has better performance? Is our 
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f2{log^ n) lower bound on the ring valid for all distance invariant mappings (not 
just the r-harmonic) on the n-node ring? Similar questions apply to any mul- 
tidimensional torus. We note that in this paper we emphasized greedy routing, 
in the sense that nodes forward messages to their neighbors which are closer to 
the destination. An interesting open problem is to study the resulting tradeoff 
between memory (required at the nodes of the network) and type of routing 
being used. 
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Abstract. We investigate the problem of communication in an ad-hoc 
mobile network, that is, we assume the extreme case of a total absense of 
any fixed network infrastructure (for example a case of rapid deployment 
of a set of mobile hosts in an unknown terrain). We propose, in such a 
case, that a small subset of the deployed hosts (which we call the support) 
should be used for network operations. However, the vast majority of the 
hosts are moving arbitrarily according to application needs. 

We then provide a simple, correct and efhcient protocol for communi- 
cation that avoids message flooding. Our protocol manages to establish 
communication between any pair of mobile hosts in small, a-priori guar- 
anteed expected time bounds even in the worst case of arbitrary motions 
of the hosts that not in the support (provided that they do not deliber- 
ately try to avoid the support). These time bounds, interestingly, do not 
depend, on the number of mobile hosts that do not belong in the support. 
They depend only on the size of the area of motions. Our protocol can be 
implemented in very efficient ways by exploiting knowledge of the space 
of motions or by adding more power to the hosts of the support. 

Our results exploit and further develop some fundamental properties of 
random walks in finite graphs. 



1 Introduction 

Ad-hoc Mobile Networks: An ad-hoc mobile network (caii) is a collection 
of mobile hosts with wireless network interfaces forming a temporary network 
without the aid of any established infrastructure or centralised administration. 
In an ad-hoc network two hosts that want to communicate may not be within 
wireless transmission range of each other, but could communicate if other hosts 
between them in the ad-hoc network are willing to forward packets for them. 

A basic communication problem, in such networks, is to send information from 
some sender user, S, to another designated receiver user, R. Remark that ad-hoc 
mobile networks are dynamic in nature, in the sense that local connections are 
temporary and may change as users move. The movement rate of each user might 
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vary, while certain hosts might even stop (even in “remote” areas) in order to 
execute location-oriented tasks (e.g. take measurements). 

A protocol solving this important communication problem is reliable if it 
allows the sender to be notified about delivery of the information to the receiver. 

The innovation and justification of our approach: One way to solve 
this problem is the protocol of notifying every user that the sender meets (and 
providing all the information to it) hoping that some of them will eventually 
meet the receiver. 

Is there a more efficient technique that will effectively solve the com- 
munication problem without flooding the network and exhausting the 

battery and computational power of the hosts? 

The most common way to establish communication is to form paths of inter- 
mediate nodes that lie within one another’s transmission range and can directly 
communicate with each other mm- Indeed, this approach of exploiting 
pairwise communication is common in ad-hoc mobile networks that cover a rela- 
tively small space (i.e. with diameter which is small with respect to transmission 
range) or are dense (i.e. thousands of wireless nodes) where all locations are 
occupied by some hosts; broadcasting can be efficiently accomplished. 

In wider area ad-hoc networks with less users, however, broadcasting is im- 
practical: two distant peers will not be reached by any broadcast as users may 
not occupy all intermediate locations (i.e. the formation of a path is not feasi- 
ble). Even if a valid path is established, single link ’’failures” happening when a 
small number of users that were part of the communication path move in a way 
such that they are no longer within transmission range of each other, will make 
this path invalid. Note also that the path established in this way may be very 
long, even in the case of connecting nearby hosts. 

In contrast to all such methods, we try to avoid ideas based on paths finding 
and their maintenance. We envision networks with highly dynamic movement of 
the mobile users, where the idea of “maintenance” of a valid path is inconceiv- 
able (paths can become invalid immediately after they have been added to the 
directory tables) . Our approach is to take advantage of the mobile hosts natural 
movement by exchanging information whenever mobile hosts meet incidentally. 
It is evident, however, that if the users are spread in remote areas and they do 
not move beyond these areas, there is no way for information to reach them, 
unless the protocol takes special care of such situations. 

In the light of the above, we propose the idea of forcing only a small subset 
of the deployed hosts to move as per the needs of the protocol. Assuming the 
availability of such hosts, we use them to provide a simple, correct and efficient 
strategy for communication between any pair of hosts in such networks that 
avoid message flooding. 

A scenario for rapid deployment of mobile hosts: A usual scenario 
that fits to the ad-hoc mobile model is the particular case of rapid deployment of 
mobile hosts, in an area where there is no underlying fixed infrastructure (either 
because it is impossible or very expensive to create such an infrastructure, or 
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because it is not established yet, or it has become temporarily unavailable i.e. 
destroyed or down). 

In such a case of rapid deployment of a number of mobile hosts, it is possible 
to have a small team of fast moving and versatile vehicles, to implement the 
support. These vehicles can be cars, jeeps, motorcycles or helicopters. We inter- 
estingly note that this small team of fast moving vehicles can also be a collection 
of independently controlled mobile modules, i.e. robots. This specific approach is 
inspired by the recent paper of J. Walter, J. Welch and N. Amato. In their paper 
“Distributed Reconfiguration of Metamorphic Robot Chains” (ini) the authors 
study the problem of motion co-ordination in distributed systems consisting of 
such robots, which can connect, disconnect and move around. The paper deals 
with metamorphic systems where (as is also the case in our approach) all mod- 
ules are identical. Note that the approach of having the support moving in a 
co-ordinated way, i.e. as a chain of nodes, has some similarities to mi. 

Our results: We provide a particular protocol (and a specific support coor- 
dination subprotocol) which guarantees correct and efficient communication for 
any pair of users, in (expected) time depending only on the size of the network 
area, independently of the motion of the hosts not in the support and indepen- 
dently of their number. We achieve this by assuming that a small part of the 
deployed hosts, which we call the support, can move fast in a coordinated way, 
to sweep the motion space and act as an intermediate pool for receiving and 
delivering messages to the mobile users. 

In a way similar to m these moving modules are identical in computing and 
communication (i.e. transmission) capability and run the same support manage- 
ment subprotocol to determine movement and communication of the hosts in the 
support. Furthermore, note that each module in the support needs only to know 
its current location (i.e. only local information is needed and not a global picture 
of the entire area). However, additional global information (such as knowledge 
of a spanning subgraph of the motion space) can improve the performance of 
our protocol. 

Our protocol is simple, scalable, does not assume common sense of orienta- 
tion, and does not need a lot of memory. It is resilient to single-host failures 
of the support. Furthermore, our protocol avoids the problem of flooding the 
network with messages. 

The proof of our main theorem exploits the fundamental notion of strong 
stationary times of reversible Markov Chains. This notion allows us to consider 
general motion strategies of the users not in the support. 

In ^ we performed extensive experiments (and some analysis) of a version 
of such a strategy but without the general framework and only for the restricted 
case where all users (even those not in the support) perform independent and 
concurrent random walks. A model for motion (without geometry details) for 
mobile networks was introduced by members of our team in CH . Related material 
has appeared as a brief announcement in the Proceedings of the 20th Annual 
Symposium on Principles of Distributed Computing, . For a survey of selected 
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work in distributed communication and control issues in ad-hoc mobile networks, 
see 0. 

Previous Work: In a recent paper H3|, Q.Li and D.Rus present a model 
which has some similarities to ours. The authors give an interesting, yet different, 
protocol to send messages, which forces all the mobile hosts to slightly deviate 
(for a short period of time) from their predefined, deterministic routes, in order 
to propagate the messages. Their protocol is, thus, compulsory for any host and 
it works only for deterministic host routes. Moreover, their protocol considers 
the propagation of only one message (end to end) each time, in order to be 
correct. In contrast, our support scheme allows for simoultaneous processing 
of many communication pairs. In their setting m show optimality of message 
transmission times. 

M. Adler and C.Scheideler 0 in a previous work, dealt only with static trans- 
mission graphs i.e. the situation where the positions of the mobile hosts and the 
environment do not change. In 0 the authors pointed out that static graphs 
provide a starting point for the dynamic case. In our work, we consider the 
dynamic case (i.e. mobile hosts move arbitrarily) and in this sense we extend 
their work. As far as performance is concerned, their work provides time bounds 
for communication that are proportional to the diameter of the graph defined 
by random uniform spending of the hosts, while our time bounds are linear to 
the area of motions, and independent of the number of mobile hosts, or their 
spreading. 

We quantify our protocol’s performance (in terms of communication time) 
and we show how to make it efficient and how to estimate the best size of the 
support. 



2 The Model of the Space of Motions 

Based on the work of we abstract the environment where the stations move 
(in three-dimensional space with possible obstacles) by a motion- graph (i.e. we 
neglect the detailed geometric characteristics of the motion). In particular, we 
first assume that each mobile host has a transmission range represented by a 
sphere tr centred by itself. We approximate this sphere by a cube tc with volume 
V{tc) the maximum such that V{tc) < V{tr). Given that the mobile hosts are 
moving in the space S, S is divided into consecutive cubes of volume V(tc). 

Definition 1. The motion graph G(V,E), (\V\ = n, \E\ = m), which corre- 
sponds to a quantization of S is constructed in the following way: a vertex u € G 
represents a cube of volume V(tc). An edge (u,v) € G if the corresponding cubes 
are adjacent. 

The number of vertices n, actually approximates the ratio between the vol- 
ume of space S, V(5), and the space occupied by the transmission range of a 
mobile host V{tr). Given the transmission range tr, n depends linearly on the 
volume of space S regardless of the choice of tc, and n = O ( ) . Let us call 
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the ratio by the term relative motion space size and denote it by p. Since 
the edges of G represent neighbouring polyhedra each node is connected with 
a constant number of neighbours, which yields that m = 0(n). Let A be the 
maximum vertex degree of G. 



3 A Protocol Framework for Ad-hoc Mobile Networks 

We wish to look into ad-hoc networks where a small part of their hosts is used 
to serve network needs for communication. This is captured by the following: 

Definition 2. The class of ad-hoc mobile network protocols which enforce a 
(small) subset of the mobile hosts to move in a certain way is called the class of 
semi-compulsory protocols. 



Definition 3. The subset of the mobile hosts of an ad-hoc mobile network whose 
motion is determined by a network protocol V is called the support E ofV. The 
part of V which indicates the way that members of E move and communicate is 
called the support management subprotocol ofV. 



Definition 4. Consider a family of protocols, T , for a mobile ad-hoc network, 
and let each V in T have the same support (and the same support management 
subprotocol). Then E is called the support of the family T . 

In addition, we may wish that the way hosts in E move (maybe coordinated) 
and communicate is robust (i.e. can tolerate failures of hosts). 

The types of failures of hosts that we consider here are permanent (i.e. stop) 
failures. 

Definition 5. A support management subprotocol, Ms, is /c-faults tolerant, if 
it still allows the members of if (or V) to execute correctly, under the presence 
of at most k permanent faults of hosts in E (k > 1). 

We assume, that the motions of the mobile users which are not members of 
E are arbitrary but independent of the motion of the support (i.e. we exclude 
the case where some of the users not in E are deliberately trying to avoid E). 
This is a pragmatic assumption usually followed by application protocols. We 
call it the independence assumption. 

Definition 6. A ad-hoc mobile network is not hostile if the hosts not in E obey 
the independence assumption. 
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4 Our Proposed Strategy 

4.1 The Scheme 

Our proposed scheme, in simple terms, works as follows: The nodes of the support 
move fast enough in a coordinated way so that they sweep (in sufficiently short 
time) the entire motion graph. Their motion and communication is accomplished 
in a distributed way via a support management subprotocol M^. When some node 
of the support is within communication range of a sender, an underlying sensor 
subprotocol M'j. notifies the sender that it may send its message (s). 

The messages are then stored “somewhere within the support structure”. 
For simplicity we may assume that they are copied and stored in every node of 
the support. This is not the most efficient storage scheme and can be refined 
in various ways. When a receiver comes within communication range of a node 
of the support, the receiver is notified that a message is “waiting” for him and 
the message is then forwarded to the receiver. For simplicity, we will also as- 
sume that message exchange between nodes within communication distance of 
each other takes negligible time. Note that this general scheme allows for easy 
implementation of many-to-one communication and also multicasting. In a way, 
the support E plays the role of a (moving) skeleton subnetwork (of a “fixed” 
structure, guaranteed by the motion subprotocol M^), through which all com- 
munication is routed. From the above description, the size, fc, and the shape of 
the support may affect performance. 

Our scheme follows the general design principle of mobile networks (with 
a fixed subnetwork however) called the “two-tier” principle (EH) which says 
that any protocol should try to move communication and computation to the 
fixed part of the network. Our idea of the support if is a simulation of such a 
(skeleton) network by moving hosts, however. 

Note that the proposed scheme does not require the propagation of messages 
through hosts that are not part of E, thus its security relies on the support’s 
security and is not compromised by the participation in message communication 
of other mobile users. For a discussion of intrusion detection mechanisms for 
ad-hoc mobile networks see ESI. 

4.2 The Implementation Proposed for S, Ms 

There is a set-up phase of the ad-hoc network, where a predefined set, fc, of 
hosts, become the nodes of the support. The members of the mobile support 
perform a leader election by running a randomized symmetry breaking protocol 
in anonymous networks (mi). This imposes only an initial communication cost. 
The elected leader, denoted by MSq, is used to co-ordinate the support topol- 
ogy and movement. Additionally, the leader assigns local names to the rest of 
the support members {MS\, MS 2 , ■ ■ ■ , MSk-i)- The movement of E is then 
defined as follows: 

Initially, MSi^ Vi G {0,1,... , A: — 1}, start from the same area-node 

of the motion graph. The direction of movement of the leader MSq is 
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given by a memoryless operation that chooses randomly the direction 
of the next move. Before leaving the current area-node, MSq sends a 
message to MSi that states the new direction of movement. MSi will 
change its direction as per instructions of MS^ and will propagate the 
message to MS 2 - In analogy, MSi will follow the orders of MSi-i after 
transmitting the new directions to MSi+i. Movement orders received by 
MSi are positioned in a queue Qi for sequential processing. The very 
first move of MSi, Vi G {1, 2, . . . , fc — 1} is delayed by 6 period of time. 

We assume that the mobile support hosts move with a common speed. Note 
that the above described motion subprotocol M'^ enforces the support to move 
as a “snake”, with the head (the elected leader MSq) doing a random walk on 
the motion graph G and each of the other nodes MSi executing the simple pro- 
tocol “move where MSi-i was before”. Therefore our protocol does not require 
common sense of orientation. 

The purpose of the random walk of the head is to ensure a cover (within 
some finite time) of the whole motion graph, without memory (other than local) 
of topology details. Note that this memoryless motion also ensures fairness. 

A modification of Ms is that the head does a random walk on a spanning 
subgraph of G (eg. a spanning tree). This modified Ms (call it Ts) is more 
efficient in our setting since “edges” of G just represent adjacent locations and 
“nodes” are really possible host places. 

4.3 Alternative Implementations - Extensions 

One can think also of other ways to implement the support management sub- 
protocol Ms- 

- The runners implementation of Ms allows each member of E to move via an 
independent random walk (on the same spanning subgraph of G). When runners 
meet, they exchange information given to them by hosts. This management 
subprotocol provides improved reliability in the sense that it is resilient to t 
faults, where t < k. However, note that messages may have to be re-transmitted 
in the case that only one copy of them exists when the faults occur. 

The key observation justifying this approach (and maybe its superiority, with 
respect to performance, compared to the “snake” approach) is that each runner 
will meet each other in parallel, thus accelerating the spread of information. 
In jS] we experimentally showed that the “runners” protocol outperforms the 
“snake” protocol. 

- In hierarchical motion graphs p| we can divide E into a subset E' moving only 
in the upper level of the hierarchy and the hosts oi E — E' which can be split 
in “snakes” , each randomly walking inside the lower levels of the hierarchy. The 
lower level of the hierarchy may model dense ad-hoc subnetworks of mobile users 
that are unstructured and where there is no fixed infrastructure. To implement 
communication in such a case, a possible solution would be to install a very fast 
(yet limited) backbone interconnecting such highly populated mobile user areas. 
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while using the support approach in the lower levels. This fast backbone provides 
a limited number of access ports within these dense areas of mobile users. 

In such hierarchical cases communication between users in different dense 
areas takes place in the following way: The support first gets from the sender 
node the messages upon meeting him and conveys these messages to the back- 
bone system when meeting the corresponding access port. Then by exploiting 
the very fast communication over the backbone, Ui forwards the messages to 
some access port in the receiver area, from which subsequently the messages are 
picked by the local support (T' 2 ) and delivered to the receiver host. 

We note that this hierarchical approach for a management subprotocol is 
inherently modular. 



4.4 Protocol Correctness Properties 

In the sequel we investigate non-hostile ad-hoc mobile networks. We assume 
that each mobile host has sufficient power supplies (or on-line power feedings) 
to support communication for long times. Moreover, we assume (to simplify the 
technical analysis) common speed and fixed transmission range for the hosts not 
in the support. 

In the sequel, we assume that the head of E does a continuous time random 
walk on G{V,E), without loss of generality (we can discretize). We define the 
random walk of a mobile user on G that induces a continuous time Markov chain 
Ma as follows: The states of Mq are the vertices of G. Let st denote the state of 
Mq at time t. Given that St = u, u € V, the probability that St+dt = v, v € V, 
is p{u, v) ■ dt where 



p{u,v) 



SR if (u,v) G E 
0 otherwise 



and d{u) is the degree of vertex u. 



Definition 7. Pi{E) is the probability that the walk satisfies an event E given 
it started at vertex i. 



Definition 8. For a vertex j , let Tj be the first hitting time of the walk onto that 
vertex and let EiTj be its expected value, given that the walk started at vertex i 
ofG. 

Definition 9. For the walk of E ’s head, let 7r() be the stationary distribution of 
its position after a sufficiently long time. 

We know (see [3) that for every vertex a, 7t(ct) = where d{a) is the 
degree of cr in G and m = \E\. 

Definition 10. Let pj^k be the transition probability of the walk of E ’s head 
from vertex j to vertex k. Let pj^k(t) be the probability that the walk started at j 
will be at k £ V in time t. 



An Efficient Commnnication Strategy for Ad-hoc Mobile Networks 



293 



Theorem 1. The support S and the management subprotoeol Ms guarantee 
reliable eommunication establishment between any sender-receiver (S,R) pair in 
finite time, whose expected value is bounded only by a function of the relative 
motion space size p and does not depend on the number of hosts, and is also 
independent of how S, R move. 

Proof. Any sender S or receiver R is allowed an arbitrary strategy of motion 
but it does not deliberately try to avoid the support E. So, it either executes a 
deterministic motion (which either stops at a node, or repeats forever) or follows 
a random strategy independent of the random walk of the support’s head. 

For the proof purposes, it is enough to show that the head of S will meet S 
and R infinitely often, with probability 1 (in fact our argument is a consequence 
of the Borel-Cantelli Lemmas for infinite sequences of trials). We will furthermore 
show that the first meeting time M (with S or R) has an expected value (where 
expectation is taken over the walk of S and any strategy of S (or R) and any 
starting position of S (or R) and E) which is bounded by a function of the size 
of the motion graph G only. This then shows the Theorem since it shows that S 
(and R) meet with the head of E infinitely often, each time within a bounded 
expected duration. 

So, let EM be the expected time of the (first) meeting and m* = supEM, 
where the supremum is taken over all starting positions of both E and S (or R) 
and all strategies of S (one can repeat the argument with R) . 

We will now assume w.l.o.g. (see 0) that the head of T”s walk is a continuous- 
time random walk on G. The states of the walk of T”s head are just the vertices 
of G and they are finite. 

Definition 11. Let X(t) be the position of the walk at time t 

We proceed to show that we can construct for the walk of E’s head a strong 
stationary time sequence Vi such that for all a G V and for all times t 



Notice that at times V), S (or R) will necessarily be at some vertex a of V, 
either still moving or stopped. Let m be a time such that for X, 



for all j,k. Such a u always exists because Pj,k{t) converges to 7r(fc) from basic 
Markov Chain Theory. Note that u depends only on the structure of the walk’s 
graph, G. In fact, if one defines separation from stationarity to be 



PfiX{V) = a \ V = t) = 7T{a) 




s(t) = maXjSj{t) 



where 



Sj{t) = sup{s : Pij{t) > (1 - s)iTj } 



then 
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= min{t : s{t) < e 

is called the separation threshold time. For general graphs G of n vertices this 
quantity is known to be G(n^) (0). 

Now consider a sequence of stopping times Ui G {u, 2u, 3u, . . . } such that 

Pi{X{Ui) = a I U^ = u) = (1) 

for any a £V . By induction on A > 1 then 
P,{X{Ui)=a I U, = \u) = 

This is because of the following: First remark that for A = 1 we get the 
definition of Ui. Assume that the relation holds for (A — 1) i.e. 

P,{X{U,) = a I U, = {\-l)u) = 
for any a €V . Then Vcr G tA 



P,{X{Ui)=a I U, = \u) = Y,P,{X{Ui) = a \ U, = {\-l)u) ■ P^^,{u) 

aev 

= - \ 7r(a)-7r((r) from (IQ 

' aev ® 

= — -^7r(cr) 



which ends the induction step. Then, for all a 



P^{X{U^) = a) = 7 t ( ct ) 



( 2 ) 



and 



= 




-1 



Now let c = ^ So, we have constructed (by (j2I)) a strong stationary 

time sequence Ui with EUi = c. Consider the sequence 0 = Uq < U\ < U 2 < . . . 
such that for i>0 



E{Ui+i - Ui I Uj, j <i) < c 
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But, from our construction, the positions X(Ui) are independent (of the dis- 
tribution 7 t()), and, in particular, X{Ui) are independent of Ui. Therefore, re- 
gardless of the strategy of S (or R) and because of the independence assumption, 
the support’s head has chance at least mine 7r(cr) to meet S (or R) at time Ui, 
independently as i varies. So, the meeting time M satisfies M < Ut where T is 
a stopping time with mean 



mina-'!T{a) 

Note that the idea of a stopping time T such that X{T) has distribution tt 
and is independent of the starting position is central to the standard modern 
Theory of Harris - recurrent Markov Chains (see e.g. fTTljl. 
i^From Wald’s inequality (|2j) then EUt < c • ET, thus 

* / 1 
m < c — ^ 

mm„ 7r(cr) 

Note that since G is produced as a subgraph of a regular graph of fixed degree 
A we have 



1 / ^ 1 
- — < 7r(cr) < — 

2m n 

for all a (n=|V|, m=|E|), thus ET < 2m, hence 

e 



m* < 2mc = 



e — 1 



2mu 



Since m, u only depend on G, this proves the Theorem. 



□ 



Corollary 1. If S’s head walks randomly in a regular spanning subgraph of G, 
then m* < 2cn. 

Now, we examine the robustness of the motion management subprotocol 
under single stop-faults. 

Theorem 2. The support management subprotoeol Ms is 1-fault tolerant. 

Proof. If a single host of S fails, then the following host becomes the head of 
the rest of the “snake”. We thus have two, independent, random walks in G (of 
the two “snakes”) which, however, will meet in expected time at most m* (as in 
Theorem 1) and re-organize as a single snake via a very simple re-organization 
protocol which is the following: 

When the head of the second snake T '2 meets a host h of the first snake Ei 
then the head of E 2 follows the host which is “in front” of h in Ei, and all the 
part of Si after and including h waits, to follow T' 2 ’s tail. 

□ 

Note that in the case when more than one faults occur, the procedure for 
merging “snakes” described above may lead to deadlock, as figure Q graphically 
depicts. 
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Fig. 1. Deadlock situation arising when four “snakes” are about to merge. 



4.5 Protocol Time Efficiency Properties 

Crude bounds. Clearly, one intuitively expects that if fc = |27| then the higher 
k is (with respect to n), the best the performance of S gets. 

By working as in the proof of Theorem 1, we can create a sequence of strong 
stationary times Ui such that X{Ui) G F where F = {<t : cr is a position of a host 
in the support}. Then n{a) is replaced by tt{F) which is just n{F) = 
over all a G F. So now m* is bounded as follows: 



min,reJ (Et^(^)) 

where J is any induced subgraph of the graph of the walk of T”s head such 
that J is the neighbourhood of a vertex a of radius (maximum distance simple 
path) at most k. The quantity 



min f 7r(cr) j 
^creJ ' 

is then at least ^ and, hence, ra* < c 

Since the communication establishment time, T^, between S', R is bounded 
above by X+Y+Z, where X is the time for S to meet S, Y is the time for R to 
meet S (after X) and Z is the message propagation time in E, we have for all 
S, R 



E{T,) < 



2mc 

~k~ 



+ 0{k) + 



2mc 

~k~ 



(since Z = 0{k)). The upper bound achieves a minimum when k = \f2rnc. 



Lemma 1. For the walk of E’s head on the entire motion graph G, the commu- 
nication establishment time’s expected time is bounded above by 0{y/mc) when 
the (optimal) support size IXI is ^/2mc and c is j-frju, u being the “separation 
threshold time” of the random walk on G. 
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Tighter bounds - improved protocol. To make our protocol more efficient, 
we now force the head of S to perform a random walk on a regular spanning 
graph of G. Let Gn{V, E') be such a subgraph. Our improved protocol versions 
assume that (a) such a subgraph exists in G and (b) is given in the beginning to 
all the stations of the support. By studying, in a way similar to Theorem 1 and 
0, the first meeting times and the separation from stationarity of the random 
walk on the regular spanning graph, we get the following theorem (for the proof 
see f9j7j ) : 

Theorem 3. By having S ’s head to move on a regular spanning subgraph of G, 
there is an absolute constant 7 > 0 such that the expected meeting time of S (or 
R) and S is bounded above by j . 

Remark again that the total expected communication establishment time is 
bounded above by -|- 6>(fc) and by choosing k = ^y2jn? we can get a best 

bound of 6>(n) for a support size of 0(jij. 

Corollary 2. By forcing the support’s head to move on a regular spanning sub- 
graph of the motion graph, our protocol guarantees a total expected communi- 
cation time ofO[p), where p is the relative motion space size, and this time is 
independent of the total number of mobile hosts, and their movement. 

Note also that our analysis assumed that the head of E moves according to 
a continuous time random walk of total rate 1 (rate of exit out of a node of G) . 
If we select the support’s hosts to be ip times faster than the rest of the hosts, 
all the estimated times, except of the inter-support time, will be divided by ip. 
Thus 

Corollary 3. Our modified protocol where the support is ip times faster than the 
rest of the mobile hosts guarantees an expected total communication time which 
can be made to be as small as where 7 is an absolute constant. 



5 A Lower Bound 



Lemma 2. 



TO* > max. EiTj 



Proof. Consider the case where S (or R) just stands still on some vertex j and 
E’s head starts at i. □ 



Corollary 4. When E starts at positions according to the stationary distribu- 
tion 7T of its head’s walk then, Vj, 



TO* > max Ej^Tj 



□ 



From a Lemma of (0, ch. 4, pp. 21), we know that for all i 
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E-^Ti > 



(1 - 



qiT^t 



where qi = di is the degree of i in G i.e., 



FT- min - min ^ 

-lL/ttIo ^ mm j -> mm • 



2m 



i 2m 



dl 



For regular spanning subgraphs of G of degree A we have m = where 
di = A for all i. Thus, 



Theorem 4. When S ’s head moves on a regular spanning subgraph of G, of m 
edges, we have that the expected meeting time of S (or R) and E cannot he less 
than 

man 2m 



Corollary 5. Since m = 0{n) we get a 0{n) lower bound for the expected 
communication time. In that sense, our protocol’s expected communication time 
is optimal when the support size is &{n). 



6 Extensions of Our Work 

First of all we notice that our work does not assume any particular motion of 
hosts not in E (other than that we are in non-hostile networks). We pose as 
an open problem the notion of “capture” of S (or R) in hostile networks. We 
also remark that any assumption on motions of hosts s E will lead to much 
better upper bounds on the communication time. We plan to investigate the case 
of varying transmission ranges. We also pose as an open problem the proof of 
correctness and the efficiency analysis of the proposed alternative implementa- 
tions, and especially the analytic comparison of the “snake” and the “runners” 
approach performance. Finally, it is interesting to comparatively study the per- 
formance of our approach versus other routing protocols (such as TORA, AODV, 
LAR) through experiments. 
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Abstract. We present a new non-blocking implementation of concur- 
rent linked-lists supporting linearizable insertion and deletion operations. 
The new algorithm provides substantial benefits over previous schemes: 
it is conceptually simpler and our prototype operates substantially faster. 



1 Introduction 

It is becoming evident that non-blocking algorithms can deliver significant bene- 
fits to parallel systems pMP91ILaM94l(ICH(ilAHU98l(IreH^ . Such algorithms use 
low-level atomic primitives such as compare-and-swap - through careful design 
and by eschewing the use of locks it is possible to build systems which scale to 
highly-parallel environments and which are resilient to scheduling decisions. 

Linked-lists are one of the most basic data structures used in program design, 
and so a simple and effective non-blocking linked-list implementation could serve 
as the basis for many data structures. This paper presents a novel implementa- 
tion of linked-lists which is non-blocking, linearizable and which is based on the 
the compare-and-swap (CAS) operation found on contemporary processors. 

Section 0 sketches a proof of correctness, describes the use of model-checking 
to perform exhaustive verification within a limited application domain and also 
describes empirical tests performed on execution traces from an actual imple- 
mentation. 

In Sect. 0 we compare the performance of the new algorithm against that 
of a lock-based implementation and against an existing non-blocking algorithm. 
Compared with these other thread-safe algorithms, ours provides the best perfor- 
mance on each of three simulated workloads and for every level of concurrency. 



2 Overview 

In this section we present an overview of our algorithm and the difficulty in im- 
plementing non-blocking linked-lists. As a running example consider an ordered 
list containing the integers 10 and 30 along with sentinel head and tail nodes: 



( "I ^ -( '"I ^ -( 1- 1 1 
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Such a data structure may comprise cells containing two fields: a key field 
used to store the element and a next field to contain a reference to the next cell 
in the list. 

Insertion is straightforward: a new list cell is created (below, left) and then 
introduced using single CAS operation on the next field of the proposed prede- 
cessor (below, right). 







-03 







In this case the atomicity of the CAS ensures the the nodes either side of 
the insertion have remained adjacent. This simple guarantee is insufficient for 
deletions within the list. Suppose that we wish to remove the value 10. An 
obvious way of excising this node would be to perform a CAS that swings the 
reference from the head so that the node containing 30 becomes the first in the 
list: 




- m 



Although this CAS ensures that the node 10 was still at the start of the 
list it cannot ensure that no additional nodes were introduced between the 10 
node and the 30 node. If this deletion took place concurrently with the previous 
insertion then that new node would be lost: 







The single CAS could neither detect nor prevent changes between 10 and 
30 once the deletion procedure had selected 30. Our proposed solution - and 
indeed the crux of the algorithms presented here - is to use two separate CAS 
operations in place of that single one. The first of these is used to mark the next 
field of the deleted node in some way (below, left), whereas the second is used to 
excise the node (below, right): 



\ »\ ^ " ADA 0°l ^ -' Tl f ( hI I ^ ~( r| ) 

We say that a node is logically deleted after the first stage and that it is 
physically deleted after the second. A marked field may still be traversed but 
takes a numerically distinct value from its previous unmarked state; the structure 
of the list is retained while signalling concurrent insertions to avoid introducing 
new nodes immediately after those that are logically deleted. In our example the 
concurrent insertion of 20 would observe 10 to be logically deleted and would 
attempt to physically delete it before re-trying the insertion. 
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3 Related Work 

Generalized non-blocking implementations based on CAS were presented by Her- 
lihy !Herb1IHeril;lj . However, linked-lists based on this general scheme are highly 
centralized and suffer poor performance because they essentially use CAS to 
change a shared global pointer from one version of the structure to the next. 

Valois was the first to present an effective CAS-based non-blocking implemen- 
tation of linked-lists PZSESI. Although highly distributed, his implementation is 
very involved. The list is held with auxilliary cells between adjacent pairs of or- 
dinary cells. Auxilliary exist to provide an extra level of indirection so that a cell 
may be removed by joining together the auxilliary cells adjacent to it. Valois’ 
algorithm exposes a more general and lower level interface than we do here; he 
provides explicit cursors to identify cells in the list and operations to insert or 
delete nodes at those points. 

The originally-published algorithm contained a number of errors relating to 
how reference-counted storage was managed. One has been reported previously 
and others were identified when implementing Valois’ algorithm for comparison 
in this paper Ma95lVal0H . 

To overcome the complexity of building linearizable lock- free linked-lists us- 
ing CAS, Greenwald suggested a stronger double-compare-and-swap (DCAS) 
primitive that atomically updates two storage locations after confirming that 
they both contain required values mEnn]. DCAS is not available on today’s 
multi-processor architectures. However, it does admit a simple linearizable 
linked-list algorithm: insertions proceed as described in Sect. O and deletions 
by atomic updates to the next field of the cell being removed as well as that 
of its predecessor. Greenwald’s work was an extension of earlier non-linearizable 
DCAS-based linked-list algorithms due to Massalin and Pu |MP!H| . 

4 Algorithms 

In this section we present our new algorithm in pseudo-code modeled on C-| — h 
and designed for execution on a conventional shared-memory multi-processor 
system supporting read, write and atomic compare- and- swap operations. We as- 
sume that the operations defined here are the only means of accessing linked 
list objects. Each processor executes a sequence of these operations, defining a 
history of invocations/responses and inducing a real-time order between them. 
We say that an operation A precedes B if the response to A occurs before the 
invocation of B and that operations are concurrent if they have no real-time 
ordering. 

A sequential history is one in which each invocation is followed immediately 
by its corresponding response. Our basic correctness requirement is linearizabil- 
ity which requires that (a) the responses received in every concurrent history 
are equivalent to those of some legal sequential history of the same requests 
and (&) the ordering of operations within the sequential history is consistent 
with the real-time order [IHWQOj . Linearizability means that operations appear 
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class List<KeyType> ■[ 
Node<KeyType> *head; 
Node<KeyType> *tail; 



class Node<KeyType> { 
KeyType key; 

Node *next ; 



ListO { 



head = new Node<KeyType> () ; 
tail = new Node<KeyType> () ; 



Node (KeyType key) { 



this. key = key; 

} 



head. next = tail; 

} 



} 



} 



Fig. 1. An instance of the List class contains two fields which identify the head and 
the tail. Instances of Node contain two fields identifying the key and successor of the 
node. 

public boolean List:: insert (KeyType key) { 

Node *new_node = new Node (key) ; 

Node *right_node, *left_node; 

do ■[ 

right_node = search (key, &left_node) ; 

if ((right_node != tail) && (right_node .key == key)) /*T1*/ 
return false; 

new_node .next = right_node; 

if (CAS (&(left_node .next) , right_node, new_node)) /*C2*/ 
return true ; 

} while (true); /*B3*/ 



Fig. 2. The List : : insert method attempts to insert a new node with the supplied 
key. 

to take effect atomically at some point between their invocation and response. 
Our implementation is additionally non-blocking, meaning that some operation 
will complete in a finite number of steps, even if other operations halt. 

We write CAS(addr,o,n) for a CAS operation that atomically compares the 
contents of addr against the old value o and - if they match - writes n to that 
location. CAS returns a boolean indicating whether this update took place. Our 
design was guided by the assumption that a CAS operation is slower to execute 
than a write which in turn is slower than a read. 

4.1 Implementing Sets 

Initially we will consider a set object supporting three operations: Insert (fc), 
Delete(/c), Find(A:). Each parameter k is drawn from a set of totally-ordered 
keys. The result of an Insert, a Delete or a Find is a boolean indicating success 
or failure. The set is represented by an instance of List which contains a singly- 
linked list of instances of Node. As sketched in Sect.0these are held in ascending 
order with sentinel head and tail nodes. 



} 
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public boolean List::delete (KeyType search_key) ■[ 

Node *right_node, *right_node_next , *left_node; 

do ■[ 

right_node = search (search_key, &left_node) ; 

if ((right_node == tail) I I (right_node .key != search_key) ) /*T1*/ 
return false; 

right_node_next = right_node .next ; 
if ( ! is_marked_ref erence (right_node_next) ) 
if (CAS (&(right_node.next) , /*C3+/ 

right_node_next , get_marked_ref erence (right_node_next) ) ) 
break; 

} while (true) ; /*B4*/ 

if (!CAS (&(lef t_node . next) , right_node, right_node_next) ) /+C4*/ 
right_node = search (right_node .key , &left_node) ; 
return true; 



Fig. 3. The List : : delete method attempts to remove a node containing the supplied 
key. 

public boolean List::find (KeyType search_key) { 

Node *right_node, *left_node; 

right_node = search (search_key, &left_node) ; 
if ((right_node == tail) I I 

(right_node .key != search_key) ) 
return false; 
else 

return true ; 



Fig. 4. The List::find method tests whether the list contains a node with 
the supplied key. 



The reference contained in the next field of a node may be in one of two 
states: marked or unmarked. A node is marked if and only if its next field is 
marked. Marked references are distinct from normal references but still allow 
the referred-to node to be determined - for example they may be indicated by 
an otherwise-unused low-order bit in each reference. Intuitively a marked node 
is one which should be ignored because some process is deleting it. The function 
is_marked_ref erence (r) returns true if and only if r is a marked reference. 
Similarly get_marked_ref erence (r) and get_unmarked_ref erence (r) convert 
between marked and unmarked references. 

The concurrent implementation comprises four methods (Fig.EEl)- The first 
three, List :: insert, List;:delete and List::find implement the Insert, 
Delete and Find operations respectively. The fourth. List : : search, is used dur- 
ing each of these operations. It takes a search key and returns references to two 
nodes called the left node and right node for that key. The method ensures that 
these nodes satisfy a number of conditions. Firstly, the key of the left node must 
be less than the search key and the key of the right node must be greater than 
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private Node +List :: search (KeyType search_key, Node **left_node) { 

Node *left_node_next , *right_node; 

search_again: 
do ■[ 

Node *t = head; 

Node *t_next = head. next; 

I* 1: Find left_node and right_node */ 
do { 

if ( ! is_marked_ref erence (t_next) ) { 

(*left_node) = t; 
left_node_next = t_next; 

} 

t = get_unmarked_reference(t_next) ; 
if (t == tail) break; 
t_next = t.next; 

} while (is_marked_reference(t_next) I I (t ,key<search_key) ) ; 
right_node = t; 

I* 2: Check nodes are adjacent */ 
if (left_node_next == right_node) 

if ((right_node != tail) kk is_marked_reference(right_node .next) ) 
goto search_again; /*G1+/ 
else 

return right_node; /*R1+/ 

/* 3 : Remove one or more marked nodes */ 

if (CAS (&(left_node.next) , lef t_node_next , right_node)) /+C1+/ 
if ((right_node != tail) kk is_marked_reference(right_node .next) ) 
goto search_again; /*G2+/ 
else 

return right_node; /*R2+/ 

} while (true); 

} 

Fig. 5. The List: : search operation finds the left and right nodes for a particular 
search key. 



or equal to the search key. Secondly, both nodes must be unmarked. Finally, the 
right node must be the immediate successor of the left node. This last condition 
requires the search operation to remove marked nodes from the list so that the 
left and right nodes are adjacent. As we will show the List: : search method 
is implemented so that these conditions are satisfied concurrently at some point 
between the method’s invocation and its completion. 

List: : search is divided into three sections. The first section iterates along 
the list to find the first unmarked node with a key greater than or equal to 
the search key. This is the right node. The left node preliminarily refers to the 
previous unmarked node that was found. The second stage examines these nodes. 
If left_node is the immediate predecessor of right_node then List: : search 
returns. Otherwise, the third stage uses a CAS operation to remove marked 
nodes between left_node and right mode. 
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List : : insert uses List : : search to locate the pair of nodes between which 
the new node is to be inserted. The update itself takes place with a single CAS 
operation (C2) which swings the reference in leftjnode.next from right_node 
to the new node. 

List: delete uses List: : search to locate the node to delete and then uses 
a two-stage process to perform the deletion. Firstly, the node is logically deleted 
by marking the reference contained in right mode . next (C3) . Secondly, the node 
is physically deleted. This may be performed directly (C4) or within a separate 
invocation of search. 

The List: :find method is shown in Fig. 0 It invokes List: : search and 
examines the resulting right node. 



5 Correctness 

In this section we describe three approaches taken to checking the correctness 
of the algorithms presented here. Section Ih. II outlines a proof of line ariz ability 
and progress. Sect. 10.21 describes the exhaustive testing of some cases through 
model checking and Sect. lO describes a method we used for examining traces 
from particular program runs. 



5.1 Proof Sketch 

We will take a fairly direct approach to outlining the linearizability of the op- 
erations by identifying particular instants during their execution at which the 
complete operation appears to occur atomically. 



Conditions Maintained by Search. Our argument relies on the conditions 
identified in Sect. ITTI which the implementation of List: : search guarantees 
hold at some point during its invocation. For the ordering constraints, note 
that when right node is initialized the preceding loop ensured that search_key 
< right_node . key. Similarly leftmode . key < search_key because otherwise 
the loop would have terminated earlier. 

For the adjacency condition and the mark state of the left node we must 
separately consider each return path. If List: : search returns at R1 then the 
test guarding the return statement ensures that right node was the immediate 
successor of the left node when the next field of that node was read into the 
local variable tmext. The same value of tmext is found to be unmarked before 
initializing leftmode. If List: : search returns at R2 then Cl establishes the 
required conditions. 

For the mark state of the right node, observe that both return paths confirm 
that the right node is unmarked after the point at which the first three conditions 
must be true. Nodes never become unmarked and so we may deduce that the 
right node was unmarked at that earlier point. 
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Linearization points. Let opj ^ be the m**' operation performed by processor 
i and let be the final real-time at which the List : ; search post-conditions 
are satisfied during its execution. These identify the times at which the 
outcome of the operations become inevitable and we shall take the ordering 
between them to define the linearized order of Find(fc) operations or unsuccessful 
updates. For a successful find at di^m the right node was unmarked and contained 
the search key. For an unsuccessful insertion it exhibits a node with a matching 
key. For an unsuccessful deletion or find it exhibits the left and right nodes which, 
respectively, have keys strictly less-than and strictly greater-than the search key. 

Furthermore let be the real-time at which the update C2 inserts a node or 
C3 logically deletes a node. We shall take Wj,™ as the linearization points for such 
successful updates. In the case of a successful insertion the CAS at Ui^m ensures 
that the left node is still unmarked and that the right node is still its successor. 
For a successful deletion the CAS at Ui^m serves two purposes. Firstly, it ensures 
that the right node is still unmarked immediately before the update (that is, it 
has not been logically deleted by a preceding successful deletion) . Secondly, the 
update itself marks the right node and therefore makes the deletion visible to 
other processors. 



Progress. We will show that the concurrent implementation is non-blocking. 
We will show that each successful insertion causes exactly one update, that each 
successful deletion causes at most two updates and that unsuccessful operations 
do not cause any updates. 

The CAS instructions Cl and C4 each succeed only by unlinking marked nodes 
from the list. Therefore the number of times that these CAS instructions succeed 
is bounded above by the number of nodes that have been marked. Exactly one 
node is marked during each successful deletion (C3) and therefore at most one 
update may be performed by Cl or C4 for each successful deletion. The remaining 
CAS instructions (C2 and C3) occur respectively exactly once on the return paths 
from successful insertions and deletions. 

Since there are no recursive or mutually-recursive method definitions consider 
each backward branch in turn: 

— Each time B1 is taken the local variable t is advanced once node down the 
list. The list is always contains the unmarked tail node and the nodes visited 
have successively strictly larger keys. 

— Each time B2 is taken the CAS at Cl has failed and therefore the value of 
leftjnode .next yt left jnode_next. The value of the field must have been 
modified since it was read during the loop ending at Bl. Modifications are 
only made by successful CAS instructions and each operation causes at most 
two successful CAS instructions. 

— Each time B3 or B4 is taken the CAS at C2 or C4 has failed. As before, the 
value held in that location must have been modified since it was read in 
List : : search and at most two such updates may occur for each operation. 
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— Each time G1 or G2 is taken then a node which was previously unmarked 
has been marked by another processor. As before, at most two updates may 
occur for each operation. 

5.2 Model Checking 

The dSPIN model checker was used to exhaustively verify the operations for 
certain problem domains. dSPIN is an extension of the SPIN model checker 
with adds support for pointers, storage management, function calls and local 
scopes [Hol97IIS99j . This made it more suitable than SPIN for a natural repre- 
sentation of these algorithms. 

The modeled state contains two representations of the set: one comprises a 
linked list of cells whereas the other is summarized as a bit vector. The linked 
list is updated as proposed here using atomic d_step instructions to implement 
CAS. The bit vector is checked or updated using further d_steps at the proposed 
linearization points. 

The model was parameterized according to the number of concurrent threads, 
the number of operations that each would attempt and the range of key values 
that could be used. The two largest configurations we could practicably test were 
with four threads, each performing one operation with three potential keys and 
with two threads each performing two operations with four potential keys. 

5.3 Practical Testing 

The linearizability of the operations has also been tested pragmatically. Although 
such tests cannot provide the assurances of formal methods they are nonethe- 
less important because they avoid the need to make simplifying assumptions 
for tract ability. In particular, the use of relaxed memory models means that 
the operations supported by a conventional shared memory machine are not 
linearizable; a direct implementation is likely to fail without further memory 
barrier instructions. 

It is not generally possible to record actual timestamp values for arbitrary 
operations within an running process. Instead, we surrounded the code exe- 
cuted at each linearization point with further instructions to record coherent 
per-processor cycle counts. 

The resulting intervals were recorded to an in-memory log which was then 
replayed sequentially in timestamp order. The results thus obtained were com- 
pared with those from the concurrent execution. The replay program contains 
simple heuristics to deal with overlapping intervals. If these cannot determine a 
consistent linearized order then the replay program reports unresolved inconsis- 
tencies for manual inspection and re-ordering. 

6 Results 

The algorithm described in Sect . 0 has been implemented in a combination of C 
and SPARC V9 assembly language. We evaluated its performance on an E450 
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Fig. 6. CPU time (user -1-system) accounted to the benchmark appliction for each algo- 
rithm on a variety of workloads. In each case the x-axis shows the number of concurrent 
threads. 



server running Solaris 8 and fitted with four 400MHz SPARC V9 processors 
and 4GB physical memory. It is worth emphasising that the code in Fig. [110] 
is intended merely as pseudo-code and does not reflect an optimised (or even 
necessarily correct) implementation. Processors may require additional memory 
barriers - for example between initializing the fields of a new node and intro- 
ducing it into the list, or between the CAS that logically deletes a node and the 
CAS that physically deletes it. 

The test application compared our implementation against Valois’ lock-free 
algorithm and against a straightforward one in which the list is protected by 
a mutual exclusion lodO Both lock-free algorithms were evaluated with and 
without reference-counting. All list cells were allocated ahead of time so that the 
performance of particular memory allocation functions was not included in the 
results. The code to manipulate reference-counts is based on Valois’ as modified 



^ This comparison against a lock-based algorithm is somewhat unfair: the simplified 
programming model there makes it straightforward to implement a more efficient 
data structure such as a tree or skiplist. 
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by Michael and Scott with the exception that reference counts are recursively 
decremented when a cell is freed. 

We generated a workload of insertion and deletion operations by randomly 
choosing keys uniformly distributed within a particular range, selecting equiprob- 
ably between insertions and deletions. We used per-thread linear congruential 
random number generators with the same parameters as the lrajid48 function 
from the Solaris 8 libc library. Seeds were chosen to give non-overlapping series. 

The test harness was parameterized on the algorithm to use, the number 
of concurrent threads to operate and the range of keys that might be inserted 
or deleted. In each case every thread performed 1 000 000 operations. Figure 0 
shows the CPU accounted to the process as a whole for each of the algorithms 
tested on a variety of workloads. 

It is immediately apparent that our algorithm performs notably better for 
every experiment using more than one thread. In the case of single-threaded 
execution it outperforms Valois’ algorithm in these tests and its performance 
equals that the lock-based implementation. The relative performance compared 
with Valois’ algorithm is not surprising: we avoid the need to create, traverse 
and excise auxilliary nodes. 

In addition to the workloads presented in those graphs we also tested con- 
figurations with larger ranges of keys, or where the list was initially ‘primed’ 
with a long sequence of nodes that would never be deleted. In each case this 
increased the total number of nodes in the list and thereby added to the cost of 
retrying operations when CAS instructions fail. One fear was that the lock-free 
algorithms would start to perform poorly because of the potential for multiple 
retries. We studied workloads up to lists of 65 536 elements and were unable to 
find any configuration for which the algorithms based on mutual-exclusion give 
the best performance. We suspect that although each retry becomes more costly, 
the likelihood of retries decreases as the rate of conflicting updates falls. 

Figure Ek shows the performance of reference-counted implementations. The 
CPU requirements of Valois’ algorithm are degraded by a factor of 5 in the 
single-threaded case, rising to over 11 for sixteen threads. Similarly, the CPU 
time required by our algorithms is degraded by a factor of 10 rising to over 15. 
In each case this is a consequence of need to manipulate reference-counts (using 
CAS operations) at each stage during a list’s traversal. Valois reports that he 
had originally intended to assume the use of a tracing garbage collector 

The performance of reference-count manipulation is hampered because the 
SPARC processor does not provide atomic fetch-and-add. However, measure- 
ments taken on a dual-processor Intel x86 machine (with that facility pPPr96p 
suggest that the degradation is low when compared with the overall costs seen 
here. When the reference counts lie on separate lines in the LI data cache then 
updates implemented through CAS are 10% slower than those using fetch-and- 
add. This rises to a factor of 2 degredation when the two processors attempt to 
update the same address. 

Of course, our results are optimistic in that they do not consider the cost of 
performing GC. However, as Jones and Lins write, if the size of the active data 
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structure is fixed then the cost of copying collectors may be reduced arbitrarily 
at the expense of the total heap size era. More practically, they report that 
overall costs of around 10-20% are typical in modern well-implemented systems. 

We examined a further approach to storage reclamation based on the deferred 
freeing of nodes. In this scheme each node contains an additional field through 
which it can be linked onto a to-be-freed list when it is excised from the main 
list. Each thread takes a snapshot of a global timer as its current time before 
starting each operation. Entries are removed from a to-be-freed list when the 
time of their excision precedes the minimum current time of any thread: at that 
point no thread can still have a reference to the node held in any of its local 
variables. 

Our implementation allocates a pair of to-be- freed lists for each thread. These 
are termed the old list and the new list and are held along with a separate per- 
thread timer snapshot that is more recent than the excision time of any element 
of the old list. When the minimum current time exceeds the snapshot then the 
entire contents of the old list are freed and the elements of the new list are moved 
to the old list. 

This deferred freeing scheme introduces two principal overheads when com- 
pared with the used of garbage collection. Firstly, a CAS operation is needed 
to place nodes on a to-be-freed list - in our implementation this increased the 
CPU requirements by 15% compared with the results from Fig. operating 
with 16 threads. The second overhead is the cost of removing elements from the 
to-be-freed lists and establishing when it is safe to do so. This was a further 
1% when performed every 1000 operations and 5% every 100, rising to 52% if 
performed after every operation. 

Figure [^presents a further analysis of the run-time performance of the three 
non-reference-counted algorithms, showing the distribution of execution-times 
for four different kinds of operation. These results were gathered when 8 concur- 
rent threads performing insertions and deletions of keys in the range 0 . . . 255. 
In the case of successful operations the lock-based implementation is able to 
achieve lower execution times than either lock-free scheme. However, it is also 
occasionally prone to much longer execution times which explain the higher mean 
execution time suggested by Fig. O 

The situation is somewhat different for unsuccessful operations in that both 
lock-free algorithms obtain some execution times which are lower than those of 
the lock-based implementation - recall that unsuccessful operations may occur 
without requiring any CAS operations or other updates to the data structure. 



7 Delete Greater-Than-or-Equal 

Now consider the problem of implementing a further operation of the form 
DeleteGE(A:) which returns and removes the smallest item that is greater than 
or equal to k. It is tempting to implement this by modifying List: : delete so 
that the test T1 does not fail if the key of the right node is greater than the 
search key. 
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Fig. 7. Operation-time distributions. Each graph shows execution times (in processor 
cycles) on the x-axis and numbers of occurances on the y-axis. 



Unfortunately this implementation is not linearizable. Suppose that three 
insertion operations are executed in sequence: Insert(20), Insert(15), Insert(20). 
The first two succeed and the third must fail because 20 is already in the set. 
However, consider a concurrent DeleteGE(lO) operation, attempting to delete 
any node with a key greater than or equal to 10. Concurrent execution may 
proceed as follows: 

~ List : : deleteGE invokes List : : search immediately after the first insertion 
of 20. It takes the head of the list as the left node and the node containing 
20 as the right. 

— The successful insertion of 15 occurs. 

~ The unsuccessful insertion of 20 occurs, observing 15 as the key of its left 
node and 20 as the key of its right node. 

~ List: : deleteGE(lO) completes after logically deleting the node containing 
20 . 

We must order this DeleteGE(lO) operation such that its result of 20 would 
be obtained by a sequential execution. This requires it to be placed before the 
insertion of 15 because otherwise the key 15 should have been returned in prefer- 
ence to 20. However, we must also linearize the deletion after the failed insertion 
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of 20 because otherwise that insertion would have succeeded. These constraints 
are irreconcilable. 

Intuitively the problem is that at the execution of C3 the right node need 
not be the immediate successor of the left node. This was acceptable when 
considering the basic Delete(fc) operation because we were only concerned with 
concurrent updates affecting nodes with the same key. Such an update must 
have marked the right node and so C3 would have failed. In contrast, during the 
execution of List : : deleteGE, we must be concerned with updates to any nodes 
whose keys are greater than or equal to the search key. 

We can address this by retaining the implementation of List ; : deleteGE but 
changing List : : insert in such a way that C3 must fail whenever a new node 
may have been inserted between the left and right nodes. This would mean that, 
whenever C3 succeeds, the key of the right node must still be the smallest key 
that is greater than or equal to the search key. 

This is achieved by using a single CAS operation to (a) introduce a pair of new 
nodes, one that contains the value being inserted and another that duplicates 
the right node and (&) mark the original right node: 



Such a CAS conceptually has two effects. Firstly, it introduces the new node 
into the list: beforehand the next field of the successor is unmarked and there- 
fore the right node must still be the successor of the left node. Secondly, by 
marking the contents of that next field, the CAS will cause any concurrent 
List: : deleteGE with the same right node to fail. Note that the key of the 
now-marked right node is not in the correct order. However, the existing imple- 
mentations of List : : search. List : : delete. List : : f ind and List : : deleteGE 
are written so that they do not rely on the correct ordering of marked nodes. 

8 Conclusion 

This paper has presented a new non-blocking implementation of linked lists. We 
believe that the algorithms presented here are linearizable. They have also been 
implemented and we have shown that their measured performance improves both 
on previously published non-blocking data structures and also on a lock-based 
implementation. 
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Abstract. This paper proposes a weakening of stabilization suited to 
the model of objects updated and queried by operations (method invo- 
cations). The paper’s main result is a construction for replicated search 
trees in a message-passing, synchronous distributed system. Any se- 
quence of 0{d) operations brings all trees to legitimate and consistent 
states, where d is the maximum number of items accessible in the set of 
search trees at the initial state. 



1 Introduction 

The discovery of many fundamental concepts in distributed computing can be 
traced back to research motivated by distributed databases, and especially repli- 
cated data. Replication is useful for fault tolerance and also to enhance avail- 
ability. Most often, fault tolerance is taken to mean either masking the effects of 
failures by some form of redundancy or by restoring computation from backup 
or checkpoint states. In the former case, the tolerance depends on limitations 
of the failure with respect to the amount of redundancy; in the latter, toler- 
ance depends on stable storage unaffected by failures. Notice that the idea of 
restoring from a backup copy can interfere with availability because new service 
requests may have to wait until the restoration is complete. In contrast to these 
techniques of redundancy or backup/restore, the paradigm of (self-) stabilization 
is forward error recovery, which is to say that a system makes do with a faulty 
state, by applying repairs to data without recourse to backup copies. Further, for 
stabilization there is no limit on the number of transient failures, so redundancy 
is not a primary issue. Unfortunately, the model of stabilization does not address 
two vital availability concerns: first, the behavior of a stabilizing system can be 
erratic during periods of recovery from a transient failure, and second, there are 
few investigations of replicated objects in the literature of stabilizing systems. 

This paper initiates an effort to combine the concerns of availability with the 
forward recovery paradigm of stabilization in a distributed system. We chose the 
problem of a replicated search tree as our case study because a search tree is sim- 
ilar to many indexing structures used in the important application of distributed 
databases (in fact, our construction uses a 2-3 tree, which is a simple case of a 
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B-tree). The significant properties of the algorithm presented in this paper are: 
a search tree with find, insert, and delete operations supports t identical copies 
of the data structure (one per site); find operations are local to a site to enhance 
availability; starting from data structures with arbitrary states (replicas that are 
not identical, data structures with damaged variables), recovery to a legitimate 
state with identical replicas is guaranteed; and operations can be applied to the 
search tree even while the global configuration is illegitimate, yet the running 
time of such operations is reasonable and the response to such operations is guar- 
anteed to be consistent with the operation’s effect at all sites. The construction 
is fully symmetric (not relying on a master copy for the replicas) and thus has 
other failure tolerances not described in this extended abstract. 

Related Work. Literature on replicated data, system availability, and fault- 
tolerant data structures is vast; we cite mainly here the relation of the present 
research to previous work that is self-stabilizing. Until recently, studies of sta- 
bilization did not attend to behavioral constraints during periods of recovery 
from a fault event (papers [4ISf~l are the exception). Several papers jl 1 2 \ 
have described methods for limiting the exposure of erroneous system output by 
adaptively decreasing the length of the period of convergence following a fault. 
In fact, some of these papers use data replication as a technique, but this data 
replication is not motivated directly by availability concerns. Nearly all published 
works in the area use a network model of nodes and links, and the computa- 
tional task is self-contained (interaction with agents outside the system is seldom 
mentioned). Few papers j9l 11)11 fj consider an object-centric model, in which a 
defined set of operations manipulate objects. Papers using an object model are 
concerned not only with the internal state of the object (which should eventually 
become legitimate from any initial state), but also with how operations behave 
and respond during times of instability. 

Organization. After presenting the model in Section E| and reviewing stabi- 
lization and availability concepts for an individual site in Sectional we provide a 
definition of what is an available and stabilizing replicated search tree in Section 
El An overview of the construction is given in Section 14.21 with details appearing 
in following sections. We use atomic broadcast to ensure replica coherence, and 
a secondary contribution of this paper is the introduction of a stabilizing atomic 
broadcast in Section ^31 Due to a lack of space in this extended abstract, proofs 
of correctness have been omitted. Section 0 concludes the paper. 



2 Distributed System Model 

We adopt a model of a distributed system based on I/O automata where 
sites, channels and a distributed system itself are modeled as I/O automata. This 
extended abstract omits details of the model because of lack of space. We consider 
a synchronous distributed message-passing system consisting of t sites. Each site 
is connected to any other site by a FIFO message channel (one channel for each 
direction). Sites and FIFO channels are reliable: each site correctly executes its 
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program and each channel correctly transfers messages in the order they are 
sent. 

The distributed system considered in this paper is a synchronous one because 
we make the following two assumptions. (Al) message delay between any two 
sites is bounded by some upper bound 6 and the bound S is known to every site; 
(A2) each site has a timer keeping perfect time and can set the timer to raise 
an alarm at a preset time interval. 

A configuration of a distributed system consists of the states of all sites and 
channels. An execution of a distributed system is a (possibly infinite) alterna- 
tive sequence of configurations and events E = cto, Ci, ci, 62, CT2 . . . such that 
occurrence of event Cj changes the configuration from CTi_i to cr^. Notice that an 
event sending or receiving a message changes the states of a site and a channel 
connecting to the site. 

Since we consider a synchronous distributed system, we sometimes deal with 
a timed execution (Tq, (ei, ti), cti, (62,^2), <J2 ■ ■ ■ where each event is associated 
with time ti when the event occurs. The timed execution we consider satisfies 
the following four properties: (z) the times assigned to events are non-decreasing, 
that is, ti-i < ti holds for any z; (zz) if (e,t) is an event sending a message M 
to a site z, there exists event (e', t') receiving M at site z such that t <t' <t + 5 
(this implies that the message delay has an upper bound 6 )] (zzz) if (e,t) is z’s 
event setting its timer to r, then there exists z’s alarm event {e',t') such that 
t' = t + T unless z resets the timer before t'; (iv) If an internal or send event e 
is applicable at cr^, then there exists (e, t) such that t = tk where tk is the time 
assigned to the event that changes the configuration to ak (this implies that we 
ignore time for internal and send events, that is, several internal and send events 
can be executed in an instant). 

Since a stabilizing replicated search tree starts with an arbitrary initial config- 
uration, the initial configuration may contain in-transit messages in some FIFO 
channels. We assume that such messages are received in time 5 from the initial 
configuration. In other words, the following holds: if (e, t) is z’s event receiving 
a message M at time t (> 5 ), then there exists event (e', t') sending M to i such 
that t' <t. 

3 Site Availability and Stabilization 

Each site is the host of one replica of the search tree. Three operations, insert, 
delete, and find, are defined on the search tree at each site. An insert(a;) 
operation inserts the item (with key) x and has two possible responses, ok (in- 
dicating that X was successfully inserted or already present), or full (the insert 
was aborted because the tree is full). A delete(a;) operation deletes the item x 
from the tree (if it exists) and always responds ok. A find(x) operation either 
returns the item x in the tree, empty (indicating that the tree is empty) or miss 
(the tree is not empty but does not contain x). 

Because initial configurations can be arbitrary, the internal variables of data 
structures within a site can be corrupt in the initial state. The implementation 
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of a search tree presented in m satisfies the following conditions even when 
started from an arbitrary initial state. (Bl) [Stabilization] After some number 
of operations, a point z in the history of operations occurs, after which the 
tree behaves as a usual 2-3 tree: the running time of any operation is 0(lg Z) 
where Z is the size of the tree at the point where the operation runs, and 
an insert (x) operation definitely responds full ii Z — M, and may respond 
full if 2M/3 < Z < M where M is the maximum capacity of the tree. (The 
implementation in mu guarantees reaching the point z of (Bl) within 0{d) tree 
operations where d is the number of nodes reachable from the root in the 2-3 tree 
at the initial state.) (B2) [Availability] Even during convergence to a legitimate 
state, a successful insert (a:) guarantees that the tree contains x until some 
delete(a:) is applied; if a find(a:) operation returns x even during convergence 
to a legitimate state, then subsequent f ind(a;) operations do not return miss (or 
empty) until some delete(a:) is applied. The complete definition of availability 
in cm specifies that the data structure offers some reliability and responsiveness 
guarantees at all points of any operation history: in particular, before the point z 
is reached, the running time of any operation is 0(lgM), though any insert(x) 
operation may respond full even ii Z < 2M/3. 

Although we seldom refer to the availability and stabilization properties of 
each replica in this extended abstract, they are crucial to our distributed imple- 
mentation: so long as sufficiently many operations are applied to a replica, that 
replica’s state converges to a legitimate state. 

4 Distributed Stabilization 

4.1 Stabilizing Replicated Tree 

An important motivation of replicated search trees can be to increase the avail- 
ability to applications by reducing the latency of find operations. Another im- 
portant motivation is to preserve global consistency. It is well known from studies 
of cache coherency, that these motivations can conflict. Our choice will be to let 
find operations execute only on local copies of search trees. This choice violates 
some consistency criteria: a perfectly consistent implementation would serialize 
find operations along with insert and delete operations, but to do so would 
not allow the find implementation to be entirely local. 

Let Si (1 < f < f) be a replica of the search tree at site i (there are t sites 
hosting replicas). Every operation on the replicated search tree has an originat- 
ing site. A history of operations on the replicated search tree is a sequence of 
operations and their responses. Because operations on the replicated search tree 
are specified in terms of the semantics of site operations, we denote operations 
to the replicated search tree by R_find, R_insert and R_delete, and denote 
operations on replica Si by inserti, delete^ and find,;. In the definition based 
on I/O automata, R_find, R_insert and R_delete are input events from the 
outside world, while inserti, deletei and find^ are internal events at Si. Each 
site also has an output event respond^ to return the response to the outside 
world. The input events R_find, R_insert and R_delete are not under the sites’ 
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control and, thus, they may occur at any site and at any time. However, we 
assume that each site can receive a new tree operation from the outside world 
only after it returns the response to the previous tree operation originating at 
the site. 

To define the contents of a replicated search tree we would like to assume 
that R_insert and R_delete operations are linearized in any history (the atomic 
broadcast defined later enables this assumption) . For any point ct in a linearized 
history of operations on a replicated search tree, the contents of the replicated 
search tree is the set of items C„ defined by: x G Ccr iff there is a point a' 
previous to cr where a successful R_insert(x) operation is applied at o' (i.e., the 
operation responded ok) and there is no R_delete(x) operation applied between 
a' and cr; also, the contents of replica S'i (1 < i < t) at the linearized point a 
is defined by a history that replaces each R_insert(x) by inserti(x) and each 
R_delete(x) by deletei(x). Thus all replicas have equal contents at each point 
in the linearized history. Any history of operations on the replicated search tree 
satisfies: (Cl) An R_insert(x) operation may respond full only if the number 
Z of items contained in the replicated search tree satisfies Ki < Z < Mi at some 
replica St, where values Ki and Mi are specific to the sequential implementation 
of Si- An R_insert(x) operation definitely responds full if Z = Mi at some 
replica Si. (C2) An R_find(x) operation originating at site i returns empty if 
replica Si contains no items. Otherwise, R_find(x) operation returns x if x is 
in Si, and returns miss if x is not in Si. The above properties imply that the 
initial content of the replicated search tree and every replica is the empty set. In 
the following, we assume for simplicity that all sites have identical Ki and Mi 
denoted by K and M respectively. 

In a stabilizing replicated search tree, we make no assumption on its initial 
configuration; for example, the contents of one replica may differ from that 
of another. In this case, R_find(x) operations originating at different sites but 
for the same key x may respond differently, since the R_find operation is an 
entirely local operation. However, a stabilizing replicated search tree is required 
to have identical contents in all replicas eventually, thus, some items should be 
inserted and/or deleted at some replicas in addition to operations invoked by the 
outside world. To specify operation behavior for stabilization, thus, we use an 
augmented linearized history of operations. Let "H be any history of operations 
on a replicated search tree. Let Hi be the projection of H for site i, that is, 
replace all R_insert operations by insert^ operations, all R_delete operations 
by delete^ operations, and remove all R_find operations not originating at site 
i. A replicated search tree is stabilizing if there exists, for each replica S'i (1 < i < 
t), a sequence Pi of at most 3M insert^ and delete^ operations such that H) 
satisfying the following conditions (C3) and (C4) can be obtained by shuffling Pi 
and Hi with the last operation of Pi occurring within 0(max{|Pi| | 1 < t < t}) 
operations of "H'. (Intuitively, Pi consists of at most M insert^ operations for the 
initial contents of Si, and M insert^ and M delete^ operations for convergence 
to the contents identical to all sites.) In the following conditions, Zi is the last 
point of Pi in "H'. (C3) After point Zi, properties (Cl) and (C2) hold. Prior to 
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point Zj, any R_insert(a;) operation may respond full, even if the contents of 
every site Si is below the threshold value K. (C4) If any inserti(a;) operation 
has an ok response at z\ before Zi, and this inserti(x) operation comes from 
'H.i, then no deletei(a:) operation from Pi occurs after z\. This implies that the 
item X inserted by a successful R_insert(a;) operation is guaranteed to be in the 
replicated search tree until R_delete(a;) is invoked. Note that (C4) is similar to 
the site availability requirement (B2). 

In the above, we consider all operations are invoked after the initial configu- 
ration. In an initial configuration, however, an operation with corrupted control 
variables may be in progress. Considering our protocol with much care, we can 
see that it works correctly even when starting with such initial configurations. 
However, for lack of space, and, for simplicity, we are only concerned with an 
initial configuration and new operations in this extended abstract. 



4.2 Overview of Construction 

As observed in the previous subsection, all update operations should be exe- 
cuted in the same order at the all sites to keep consistency among the replicas. 
Our implementation relies on a self-stabilizing atomic broadcast (abbreviated 
as ss-ABcast) of the update operations, which is introduced in this paper. The 
basic idea to implement the ss-ABcast is to provide (synchronous) rounds and to 
broadcast and deliver messages based on the rounds: each site can broadcast at 
most one message at each round, and all the messages broadcast at the round 
are delivered in the same order at all sites within the round. Section 14.31 shows 
our implementation of the ss-ABcast. 

To implement a stabilizing replicated search tree, we assume each replica is 
itself implemented by the stabilizing and available search tree proposed by mg. 
However, this is not sufficient to implement a stabilizing replicated search tree if 
different replicas have different contents. To solve this problem, we require some 
mechanism for convergence in addition to the mechanism for usual search tree 
operations. Thus, our implementation of stabilizing replicated search tree repeats 
two phases, the tree operation phase and the convergence phase, alternatively. 
Each phase consists of two rounds of the ss-ABcast. 

In the tree operation phase, each site receives from the outside world at most 
one update operation and broadcasts the operation using the ss-ABcast. The 
operations are applied at all sites in the same order (i.e., in the order that they 
are delivered) , and are committed or aborted depending on the responses to the 
operations in the second round of the phase. Section shows more details of 
the tree operation phase. In the convergence phase, some tree operations are 
executed to check and correct the contents of each replica so that all replicas 
should eventually have identical contents. Section 14.51 presents the convergence 
phase. 

Synchronization of phases is needed for our construction to work properly. 
Since we consider execution starting from any initial configuration, execution 
may start from the initial configuration where some sites are executing the tree 
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operation phase but other sites are executing the convergence phase. This in- 
consistency can be easily detected and corrected by, for instance, returning to 
the beginning of the tree operation phase (i.e., resetting). Thus, in the rest of 
this paper, we assume that all sites are synchronized to execute the same phase, 
and we present each phase independently from the other. 



4.3 Self- Stabilizing Atomic Broadcast 

Atomic broadcast^ guarantees that all messages broadcast by sites are delivered 
at all sites in the same order. More precisely, atomic broadcast satisfies the fol- 
lowing properties: validity, integrity and total order. In the following, ABcast(77i) 
and Deliver(m) are operations that are invoked to broadcast a message m and 
to deliver a message m respectively. (Dl) Validity. If a site invokes ABcast(m), 
then every site eventually invokes Deliver(m). (D2) Integrity. For any message 
m, every site invokes Deliver(m) at most once, and only if some site invoked 
ABcast(m). (D3) Total Order: If a site invokes Deliver(m) before Deliver(m'), 
then any other site also invokes Deliver(m) before Deliver (to'). 

A self-stabilizing atomic broadcast (ss-ABcast) protocol is a protocol that 
satisfies the validity, the integrity and the total order after finite time. Our 
implementation of ss-ABcast uses synchronous rounds to fix a set of messages to 
be delivered in a round. Each site can sequentially broadcast several messages 
by invoking ABcast operations, but we assume that the site invokes ABcast(TO 2 ) 
only after Deliver (toi) (if it invoked ABcast(TOi) before). 

Figure [D shows our ss-ABcast protocol for each site i {1 < i < t). The key 
idea of the protocol is to provide rounds for all sites and it is achieved as follows: 
To separate rounds, we introduce a quiet period between two consecutive rounds 
where no message is exchanged. If a site receives no message during 25 time 
the site judges the current round ends and the next round begins. Timer T Ai 
is used to keep the time of the quiet period. The length 25 of the quiet period 
is chosen as follows: consider two messages to and to', and let t and t' (assume 
t < t') he the times when to and to' are broadcast. lit' — t<5, the arrival order 
of the two messages at a site is unpredictable. It is possible that to is received 
before to' at a site while to' is received before to at another site. Our decision 
is, thus, these messages are delivered at the same round. On the other hand, the 
difference between arrival times of these messages may be almost 25 at a site 
if t' — t is almost 5 and the message delays of to and to' are almost 0 and 5. 
Therefore, we choose 25 as length of the quiet period. 

Since message delay varies from 0 to <5, the times when one broadcast message 
is received may differ by 5 at distinct sites. Thus, if a site broadcasts a message 
of the next round immediately on detection of the end of the round, some site 
cannot necessarily have a quiet period of 25 time. This requires additional 5 time 
until the site can broadcast the message. (Timer TBi is used to keep the time of 
the additional period.) The following lemma implies that ss-ABcast guarantees 
the properties of the atomic broadcast if we ignore “spurious” initial messages, 
which are delivered only at the first round. 
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local variables of site i 

T Ai-.ti.mer \ /* a countdown timer for a quiet period. */ 

TBi :timer ; /* a countdown timer for starting the next round * / 
rmsgi'.array of message; 

/* rmsgi\ji\ stores a message site i receives from site j * / 

On ABcast(m) /* On receipt of broadcast request of m * / 
if {T Ai or TBi is running) 
wait until alarm(THi) 
send m to all sites; /* including i */ 

On receipt of message m from site j 
rmsgi[j\ m-, 
if TBi is running 

set-timer (TBi, 0) ; /* Cancel TBi and immediately invoke alarm(TBi) */ 
set-timer (THi, 2(5) ; /* set T Ai to 2(5. */ 

/* alarm(TAi) will be invoked 25 time later unless the timer is reset. */ 

On alarm (Tj4i) 

deliver all messages in rmsgi in some predefined order; 
clear (rmsgTi) ; 
set-timer (TBi, 5) ; 



Fig. 1. The ss-ABcast protocol 



Lemma 1. Figured-presents a ss-ABcast protocol and the length of each round is 
at most 5(5. The ss-ABcast protocol satisfies the following properties: (El ) Valid- 
ity.' If a site invokes ABcast(m), then every site eventually invokes Deliver(m). 
(E2) 1-round-stabilizing Integrity.' At the second round or later, the Integrity 
is satisfied: for any message m, every site invokes Deliver(m) at most once, 
and only if some site invoked ABcast(77i). For the first round, every site in- 
vokes Deliver(m) at most once, if some site invoked ABcast(m); however, some 
messages may be delivered even if no site broadcasts such messages, henceforth 
called spurious messages. (E3) 1-round-stabilizing Total Order.- At the second 
round or later, the Total Order is satisfied: if a site invokes Deliver(m) before 
Deliver(m'), then any other site also invokes Deliver(m) before Deliver(m'). 
For the first round, the Total Order is satisfied for the messages that are actually 
broadcast by sites. 

Proof Sketch. (El) Validity: For contradiction, assume that ABcast(?7r) is in- 
voked at a site i but m is never delivered at a site j. We consider two cases. 
(Case 1) m is sent but not be delivered: When j receives m, j sets timer TAj. 
Since j never delivers m, TAj never expires after j receives m. This implies that 
j has no quiet period of length of 26 or more after j receives m. Let be a set of 
broadcast messages that are not delivered at j, M) be a set of messages 

(including m) that is sent before or at receipt of m and M .2 = A4 — AA\. Let t 
be the latest time when a message in Ali is sent. No process receives a message 
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in in period [t + <5, oo]. The first sending of a message in occurs when 
3i5 time passes at a process after the last receipt of a message in Afi. Thus, no 
message in is sent in [0,t + 35]. Therefore, no message is received at any 
process in [t + S,t + 35] and this contradicts that j has no quiet period of length 
of 25 or more. (Case 2) m is not sent (sending of m is postponed forever): This 
implies that TAi never expires after the invocation of the ABcast(m). This may 
happen if p has no quiet period of length of 25 or more after the invocation of the 
ABcast. We can show a contradiction by a similar discussion to that of Case 1. It 
also follows from the above argument that any message m is delivered within 55 
from its sending from a site (not from invocation of ABcast(m)). Therefore, the 
length of each round is at most 55. (E2) 1-round-stabilizing Integrity: Spurious 
messages can be delivered if they are in-transit or stored in rmsg at the initial 
configuration. The in-transit spurious messages are received before 5 from the 
beginning of the execution. Thus all spurious messages are delivered at the first 
round. It is easy to see that the Integrity is guaranteed at the second round or 
later. (E3) 1-round-stabilizing Total Order: It is sufficient to show that, for any 
messages mi and m 2 actually broadcast, any site delivers mi and m 2 in a round 
if a site delivers them in a round. Let i be the sender (the originator) of mi. 
From a similar discussion to that of (El), we can show that i sends mi before or 
at receiving m 2 . Similarly, the originator of m 2 sends m 2 before or at receiving 
mi. It follows that the times when mi and m 2 are sent respectively differ at 
most 5. This implies that these messages are delivered in the same round at any 
process. □ 



4.4 Tree Operation Phase 

All that is required for an algorithm to be (self-) stabilizing is eventual conver- 
gence to legitimate behavior, however even during the period of convergence to 
legitimacy, some of the semantics of update operations should be guaranteed 
(C4): For instance, after an R_insert(cc) operation that responds ok, the item x 
should be contained in all replicas. 

Consider an R_insert(x) operation applied to a replicated 2-3 tree, which 
is implemented by insert (x) operations on all replicas. It could be that the 
insert (x) operation responds ok at one site, but it responds full at another site, 
since these sites may start with different initial configurations. If all sites begin 
with empty trees and all updates are replicated in the same order everywhere, 
then all 2-3 trees will have the same internal representation of their contents, 
and the situation of conflicting insert(x) responses will not occur. But if we 
consider arbitrary initial states, then all replicas could have the same contents, 
but with different internal representations (to see this, the reader can experiment 
with different orders of insertion of the same set of items into a 2-3 tree) . Thus, it 
is possible that an insert (x) operation responds ok at one site while it responds 
full at another. 

To enforce uniformity of insert responses, one could maintain a count of 
items in each replica. Any insert operation would respond full if the current 
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count exceeds threshold K, so that despite the ambiguity of 2-3 tree representa- 
tion, all insert(x) operations would have the same local response. Introducing 
and relying upon such a counter introduces difficulties because any counter is 
subject to corruption by a transient fault. Since our goal is to preserve the se- 
mantics of any ok response during convergence, we reject the idea of a count in 
favor of another technique. 

In our implementation of Fig0 an R_insert(x) operation has two rounds, 
a propose round and a response round. When an R_insert(x) operation occurs 
at site i, then at each replica Sj (1 < j < t), the operation insertj(x) is 
applied, which responds either ok or full depending on the state of Sj. In the 
response round, the response of insertj(x) of every site j is broadcast to all 
sites. When a site j receives the responses from all sites, then it judges whether 
the R_insert(x) should be committed or aborted: if all sites respond ok, then the 
insert j(x) is committed and, moreover, site i returns ok to the outside world 
(as the response to R_insert(x)); in other cases, the insert j(x) operation is 
aborted (i.e., deletej(x) is executed to cancel insertj(x)) and, moreover, site 
i returns full to the outside world. Notice that all sites make the same decision, 
committed or aborted, concerning the R_insert operation. 

R_delete and R_insert operations require a linearized implementation of 
updates on the replicas to ensure that tree contents are everywhere the same. 
Our implementation relies on the ss-ABcast protocol in the previous subsection: 
the propose round and the response round respectively corresponds to one round 
of the ss-ABcast protocol. In the propose round, at most one operation is invoked 
at each site, and the ss-ABcast protocol ensures that the operations are applied 
in the same order at all sites. However some sites may also apply some oper- 
ations other than actually invoked ones because of the spurious messages the 
ss-ABcast protocol delivers (but only in the first ss-ABcast round). We guaran- 
tee availability (C4) of R_insert operations by applying all delete^ operations 
before any of insert^ operations at each site i: this prevents spurious delete^ 
operations from deleting the items inserted by R_insert operations. In the re- 
sponse round, each process broadcasts a response vector containing all responses 
of the update operations applied at the previous round. It is not necessary to 
deliver the response vectors in the same order at all sites, but, for simplicity, our 
implementation uses the ss-ABcast protocol to broadcast the response vectors. 

Details of the protocol are presented in Figure |3 and we discuss the main fea- 
tures of the logic here. The protocol deals with only update operations R_insert 
and R_delete because R_find is an entirely local operation and its implemen- 
tation is straightforward. In the protocol, ABcast* is used instead of ABcast to 
broadcast operations invoked from the outside world. The difference between the 
two operations is that ABcast* postpones the broadcast to the propose round 
of the next tree operation phase if either timer TA or TB (in the ss-ABcast 
protocol) is running; recall that ABcast simply postpones the broadcast to the 
next round. This modification is necessary because R_insert and R_delete may 
occur at any time but should be dealt with in a tree operation phase. 
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local variables of site i 

opi:array of operation; /* opi[j] is the operation originating at site j */ 

rspi'.aTTay of response; /* rspi[j] is the response to opi[j] at site i * / 

rcv_rspi:array of response; 

/* rcv-rspi[j, k] is the rspj[k] received from site j * / 

On receipt of operation Opi from the outside world 
/* propose round begins */ 

ABcast*(Opi) ; 

repeat /* execute local operations in the order delivered * / 
wait until Deliver(op); 

(R_delete operations are delivered before any R_insert operation) 
if op — R_delete(a;) originating at j 
0Pi[j] ~ op\ deleters:); rspi[j] := ok, 
if op = R_insert(a;) originating at j 

opi[j] := op; inserti(r); rspi[j] —response to inserti(a;); 
until all messages are delivered; 

/* response round begins */ 

ABcast(rspi[l--f]) > 

repeat /* store response vectors */ 
wait until Deliver (rsp[l..t]) ; 
if rsp[l..t] is received from j 
rcv-rspi[j, l..t] := rsp\l..i\; 
until all messages are delivered; 

for each k G /* determine whether R_insert is committed or aborted */ 
if opi[k] = R_insert(x) 

if rsV-Tspi[j, k] = full for some j 

deletei(x); /* cancel inserti(a;) */ 
if Opi = R_insert (a;) and rsv-rspi[j,i] = full for some j 
respond(/ji//) 
else respond(ofc) 



Fig. 2. A protocol for tree operation phase 



4.5 Convergence Phase 

An illegitimate state for the replicated tree requires repair operations: an arbi- 
trary sequence of R_insert and R_delete operations need not force all replicas 
to have identical contents if they initially differ. Therefore, in the convergence 
phase, additional operations are added. The basic idea is that all replicas partic- 
ipate in a coordinated enumeration of their contents. Suppose this coordinated 
enumeration begins with the smallest item in each replica: each site broadcasts 
the smallest item of its replica. If all sites have the same smallest item, then the 
coordinated enumeration will continue with the second smallest item. This will 
continue until all sites enumerate the largest tree item, where the enumeration 
will start again with the smallest tree item. Since at most t items were newly 
inserted in each tree operation phase, t -I- 1 or more items should be enumerated 
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in each convergence phase to complete the enumeration in time depending on 
the maximum number of items accessible at a site in the initial configuration. 
Thus, in our protocol, each site first broadcasts the s (> t+ 1) smallest items in 
a convergence phase, and broadcasts the next s (> < + 1) smallest items in the 
next convergence phase, and so on. 

As stated in the above, the enumeration goes back to the smallest item 
when it reaches the maximum one. In the following, however, for simplicity of 
description, we ignore the situation where such wrapped enumeration occurs. 

The only interesting case in the enumeration occurs when sites report a dif- 
ference for some item. There are two possibilities for such an apparent difference. 
Either there is a real difference between replicas with respect to the item or sites 
are somehow uncoordinated in their enumerations. This possibility of an unco- 
ordinated enumeration can arise due to the arbitrary initial configuration. To 
detect the uncoordinated enumeration, each site broadcasts an item z called a 
starting item and the s smallest items larger than or equal to z. If sites broad- 
cast different starting items, then uncoordinated enumeration is detected and the 
enumeration starts from the smallest item in each replica by setting z to — oo. 
If all sites broadcast the same starting items, then coordination in enumeration 
is guaranteed and differences of the broadcast items imply differences among 
the contents of replicas. In this case, appropriate insert or delete operations are 
applied to each replica in order to make correction of the replicas with respect 
to the items. 

It is useful here to depart from the description of our construction and con- 
sider various possibilities for correcting the replicas. Two extreme designs are 
intersection and union. The intersection approach is to delete item x from all 
sites if X is not present in all sites. The union approach is to insert item x in all 
sites if it is present in any of them. But these two approaches are vulnerable to 
a transient fault: if an item is lost at only a single replica because of a transient 
fault, the item will be removed from all replicas in the intersection approach. 
Similarly, the union approach is vulnerable to erroneous insertion at only a single 
replica. An intermediate and more desirable approach is to use majority vote: 
delete p from all replicas unless a majority contain p, in which case p should be 
inserted in all sites. For either the union or majority approach, the remedy of 
insertion can yet fail: the result of an insert (p) operation could be full, (the 
sequential 2-3 tree implementation uni can respond full even when the number 
of items is less than the threshold capacity during the period of convergence 
to a legitimate state.) Because of the possibility of encountering full responses, 
we deploy the same two-round idea from the R_insert implementation, so sites 
apply delete(p) if p cannot be inserted in all replicas. 

Details of the protocol are shown in Fig. 0 The protocol adopts the majority 
approach for correcting the replicas, but it can be easily modified so that it should 
correct the replicas according to some other approach. In the propose round, 
each site finds and broadcasts (using the ss-ABcast protocol) the s (> t -I- 1) 
smallest items larger than or equal to the starting item z. Note that finding 
the s smallest items can be easily implemented by traversing the replica in the 
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local variables of site i 

fcej/i:array 0..s] of key; 

/* keyi[j,Q] is the starting item site j broadcasts */ 

/* keyi[j, l..s] is the s keys site j broadcasts as 
the s smallest keys no smaller than key[j,0] * / 
starti'.kej; /* the starting item */ 

array [1..2s] of key; /* the inserted keys */ 
rcv_rspi:array 1..2s] of response; 

/* rcv-rspi[j,m] is response to insert (ins_fcej/[m]) at site j */ 

begin 

/* propose round begins */ 
keyi[i, 0] := starti\ 
for each m (1 < m < s) 
keyi[i,m] : = 

the smallest key no smaller than keyi[i,Q\ in the replica of i\ 
ABcast(fcej/i[i, 0..s]); 

repeat /* store keys broadcast by sites */ 
wait until Deliver(fcei/[0..s]) ; 
if key[0..s] is received from j 
keyi[j,Q..a\ := key\Q..s\, 
until all messages are delivered; 
if 0] yf keyi[k,Q] for some j and k 

/* uncoordinated enumeration is detected * / 
starti := — 00 

/* reset the starting item so that enumeration should 
start from the smallest item */ 
exit /* skip the correction phase */ 
else /* coordinated enumeration is guaranteed */ 

apply rules (Rl) and (R2) for every item in keyi[l..t, l..s]; 
store the keys satisfying the condition of (Rl) in ins-keyi[1..2s]’, 
store the corresponding responses in rcvjrspi[i,1..2s]; 
starti := the keys chosen by rule (R3) ; 

/* response round begins */ 

ABcast (ret l--2s]); 

repeat /* store the responses broadcast by sites */ 
wait until Deliver (rsp[l.. 2s]) ; 
if rsp[1..2s] is received from j 
rcv-rspi[j, 1..2s] := rsp[1..2s]; 
until all messages are delivered; 
for each k {1 < k < 2s) 

if rcv-rspi[j, k] =full for some j 

deletei[*ns_fcej/i[fc]]; /* cancel insert; operation */ 



Fig. 3. A protocol for the convergence phase 



depth-first fashion from z. Unless the uncoordinated enumeration is detected, 
each site i tries to apply the following two rules to each of the delivered items. In 
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the rules, x denotes the item under checking and Ij (1 < j < t) denotes the set 
of the s items site i received from site j : (Rl) If a majority of Ij’s each contain 
X, then site i executes inserti(x) if x is not in the i’s replica. (R2) If a majority 
of Ij’s each do not contain x but contain some item larger than x, then site i 
executes deletei(a;) if x is in the i’s replica. (Notice that the condition of this 
rule guarantees that only a minority contains x.) After applying the above rules 
to all delivered items, the new starting item z' is determined by the following 
rule. This rule guarantees that, for any item a: in a replica such that z < x < z' 
(where z is the old starting item), x is contained in all replicas. (R3) If every 
broadcast item satisfies either conditions of (RI) or (R2), then the imaginary 
key a;’*' is chosen as z' such that a;+ > a; for the largest broadcast item x and 
x~^ < x' for any item x' > x. Otherwise, the smallest broadcast item that does 
not satisfy either conditions of (Rl) or (R2) is chosen as the new starting item. 
In the following response round, the responses to the insert operations executed 
locally at the site are broadcast (using the ss-ABcast protocol), and the insert 
operations are committed or aborted in the same way as the tree operation phase. 
Notice that the number of the responses each site broadcasts at a response round 
is less than 2s, because only the responses to insert operations are broadcast. 

Now we give some intuitive estimation of the number of the ss-ABcast rounds 
for the convergence. Regard that each item is dirty at the initial configuration 
and becomes clean when it is removed or is guaranteed to be contained in all 
replicas during the convergence phase. Assume that coordination of enumeration 
is established. First, consider that, in each convergence phase, each site i proposes 
one dirty item Xi from its replica and applies (Rl) and (R2) to the proposed 
items. If the proposed items have a key as its majority, the rule (Rl) makes the 
majority (at least t/2 items) clean. Otherwise let z be the new starting point 
chosen by the rule (R3). Notice that there exist at least n/2 items x such that 
X < z among the proposed items. We can see that these items become clean in 
this or the next convergence phase. In our protocol, each site proposes s (> <-|-l) 
items in each convergence phase. By amortized estimation, we can regard that 
each site proposes at least s — t (> 1) dirty items in each convergence phase. 
From the above discussion, two consecutive convergence phases make at least 
t/2 items clean and, thus, we can show the following theorem. 

Theorem 1. The protoeol eonstrueted from the protoeols of Fig. ^ and Fig. 
0 is an implementation of a stabilizing search tree. It reaches a configuration 
such that all replica have identical contents within 0{d) rounds of the ss-ABcast 
protocol where d is the maximum number of items accessible at a site in the 
initial configuration. 



5 Conclusion 

This paper proposed a construction for available and stabilizing replicated search 
trees in a message-passing, synchronous distributed system. One of the main con- 
tributions of this work is to introduce stabilization and availability into replicated 
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objects. We have addressed two aspects of availability: R_f ind operations are ex- 
ecuted locally, and all operations are allowed, with some reliability guarantees, 
at all points in an execution. Another contribution of this paper is to present 
a stabilizing atomic broadcast protocol, a general and powerful tool for design- 
ing several services in a stabilizing fashion. Our construction is based on the 
stabilizing atomic broadcast and checking-and-correction of replicas. This gives 
a framework for constructing available and stabilizing replicated objects and is 
expected to be applied for implementing other replicated objects. 
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Abstract. An adding network is a distributed data structure that sup- 
ports a concurrent, lock- free, low-contention implementation of 
a fetch&add counter; a counting network is an instance of an adding 
network that supports only fetch&increment. 

We present a lower bound showing that adding networks have inherently 
high latency. Any adding network powerful enough to support addition 
by at least two values a and b, where |a| > |fe| > 0, has sequential execu- 
tions in which each token traverses fi{n/c) switching elements, where n is 
the number of concurrent processes, and c is a quantity we call one-shot 
contention-, for a large class of switching networks and for conventional 
counting networks the one-shot contention is constant. On the contrary, 
counting networks have O(logn) latency 14171 . 

This bound is tight. We present the first concurrent, lock-free, low- 
contention networked data structure that supports arbitrary fetch&add 
operations. 



1 Introduction 

Motivation-Overview 

A fetch&increment variable provides an operation that atomically adds one to 
its value and returns its prior value. Applications of fetch&increment counters 
include shared pools and stacks, load balancing, and software barriers. 

A counting network is a class of distributed data structures used to con- 
struct concurrent, low-contention implementations oi fetch&increment counters. 
A limitation of the original counting network constructions is that the result- 
ing shared counters can be incremented, but not decremented. More recently, 
Shavit and Touitou HU showed how to extend certain counting network con- 
structions to support decrement operations, and Aiello and others extended 
this technique to arbitrary counting network constructions. 
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Island, August 2001. Part of the work of the first author was performed while affilia- 
ting with the Max-Planck Institut fiir Informatik, Saarbriicken, Germany, and while 
visiting the Department of Computer Science, Brown University, Providence, USA. 
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© Springer- Verlag Berlin Heidelberg 2001 



Adding Networks 



331 



In this paper we consider the following natural generalization of these recent 
results: can we construct network data structures that support lock-free, highly- 
concurrent, low-contention /etc/i0ad(i operations? {A fetch&add atomically adds 
an arbitrary value to a shared variable, and returns the variable’s prior value.) 

We address these problems in the context of concurrent switching networks, a 
generalization of the balancing networks used to construct counting networks. As 
discussed in more detail below, a switching network is a directed graph, where 
edges are called wires and nodes are called switches. Each of the n processes 
shepherds a token through the network. Switches and tokens are allowed to have 
internal states. A token arrives at a switch via an input wire. In one atomic step, 
the switch absorbs the token, changes its state and possibly the token’s state, 
and emits the token on an output wire. 

Let S be any non-empty set of integer values. An S-adding network is a 
switching network that implements the set of operations f etch&add(-,s), for s 
an element of S. As a special case, an (a, &)-adding network supports two opera- 
tions: f etch&add(-,a) and f etch&add(-,6). A process executes a f etch&add(-,a) 
operation by shepherding a token of weight a through the network. 

Our results encompass both bad news and good news. First the bad news. We 
define the network’s one-shot contention to be the largest number of tokens that 
can meet at a single switch in any execution in which exactly n tokens enter the 
network on distinct wires. For counting networks, and for the (first) switching 
network presented in Section 0 this quantity is constant. We show that for 
any (a, &)-adding network, where |a| > |6| > 0, there exist n-process sequential 
executions where each process traverses f?(n/c) switches, where c is the network’s 
one-shot contention. This result implies that any lock- free low-contention adding 
network must have high worst-case latency, even in the absence of concurrency. 
As an aside, we note that there are two interesting cases not subject to our 
lower bound: a low-latency (a, — a)-adding network is given by the antitoken 
construction, and an (a, 0)-adding network is just a regular counting network 
augmented by a pure read operation. 

Now for the good news. We introduce a novel construction for a lock-free, 
low-contention fetch&add switching network, called Ladder, in which processes 
take 0{n) steps on average. Tokens carry mutable values, and switching elements 
are balancers augmented by atomic read- write variables. The construction is 
lock-free, but not wait-free (meaning that individual tokens can be overtaken 
arbitrarily often, but that some tokens will always emerge from the network in a 
finite number of steps). Ladder is the first concurrent, lock-free, low-contention 
networked data structure that supports arbitrary fetch&add operations. 

An ideal fetch&add switching network (like an ideal counting network defined 
in m) is (1) lock- free, with (2) low contention, and (3) low latency. Although this 
paper shows that no switching network can have all three properties, any two are 
possible^ a single switch is lock-free with low latency, but has high contention, 
a combining network 0 has low contention and O(logn) latency but requires 

^ NASA’s motto “faster, cheaper, better” has been satirized as “faster, cheaper, better: 
pick any two” . 
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tokens to wait for one another, and the construction presented here is lock-free 
with low contention, but has 0(n) latency. 



Related Work 



Counting networks were first introduced by Aspnes et. aZ |3| ■ A flurry of research 
on counting networks followed (see e.g., [I r2l4l(il7lt)l I 1^ 1. Counting networks are 
limited to support only fetch&increment and fetch& decrement operations. Our 
work is the first to study whether lock-free network data structures can support 
even more complex operations. We generalize traditional counting networks by 
introducing switching networks, which employ more powerful switches and to- 
kens. Switches can be shared objects characterized by arbitrary internal states. 
Moreover, each token is allowed to have a state by maintaining its own variables; 
tokens can exchange information with the switches they traverse. 

Surprisingly, it turns out that supporting even the slightly more complex 
operation of fetch&add, where adding is by only two different integers a, h such 
that |a| > |6| > 0, is as difficult as ensuring linearizability 0. In 0 the authors 
prove that there exists no ideal linearizable counting network. In a corresponding 
way, our lower bound implies that even the most powerful switching networks 
cannot guarantee efficient support of this relatively simple fetch&add operation. 

The Ladder switching network has the same topology as the linearizable 
Skew presented by Herlihy and others but the behavior of the Ladder 
network is significantly different. In this network, tokens accumulate state as 
they traverse the network, and they use that state to determine how they interact 
with switches. The resulting network is substantially more powerful, and requires 
a substantially different analysis. 



Organization 

This paper is organized as follows. Section |2l introduces switching networks. Our 
lower bound is presented in Section 0 while the Ladder network is introduced 
in Section 0 



2 Switching Networks 

A switching network, like a counting network |3| , is a directed graph whose nodes 
are simple computing elements called switches, and whose edges are called wires. 
A wire directed from switch b to switch b' is an output wire for b and an input 
wire for b' . A Woiit)-switching network has Wm input wires and Wout output 
wires. A switch is a shared data object characterized by an internal state, its set 
of fin input wires, labeled 0, . . . , fin — 1, and its set of font output wires, labeled 
0, ■ ■ ■ , font — 1- The values fin and font are called the switch’s fan-in and fan-out 
respectively. 

There are n processes that move (shepherd) tokens through the network. 
Each process enters its token on one of the network’s Wm input wires. After the 
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token has traversed a sequence of switches, it leaves the network on one of its 
Wout output wires. A process shepherds only one token at a time, but it can 
start shepherding a new token as soon as its previous token has emerged from 
the network. Processes work asynchronously, but they do not fail. In contrast to 
counting networks, associated to each token is a set of variables (that is, each 
token has a mutable state), which can change as it traverses the network. 

A switch acts as a router for tokens. When a token arrives on a switch’s in- 
put wire, the following events can occur atomically: (1) the switch removes the 
token from the input wire, (2) the switch changes state, (3) the token changes 
state, and (4) the switch places the token on an output wire. The wires are 
one-way communication channels and allow reordering. Communication is asyn- 
chronous but reliable (meaning a token does not wait on a wire forever) . For each 
{fin, /out)-switch, we denote by Xj, 0 < i < fin — 1, the number of tokens that 
have entered on input wire i, and similarly we denote by yj, 0 < j < font — 1, 
the number of tokens that have exited on output wire j. 

As an example, a {k,£) -balancer is a switch with fan-in k and fan-out £. The 
i-th input token is routed to output wire i mod £. Counting networks are con- 
structed from balancers and from simple one-input one-output counting switches. 

It is convenient to characterize a switch’s internal state as a collection of 
variables, possibly with initial values. The state of a switch is given by its internal 
state and the collection of tokens on its input and output wires. Each token’s 
state is also characterized by a set of variables. Notice that a token’s state is 
part of the state of the process owning it. A process may change the state of its 
token while moving it through a switch. A switching network’s state is just the 
collection of the states of its switches. 

A switch is quiescent if the number of tokens that arrived on its input wires 
equals the number that have exited on its output wires: %■ 

The safety property of a switch states that in any state, yj'-’ 

that is, a switch never creates tokens spontaneously. The liveness property states 
that given any finite number of input tokens to the switch, it is guaranteed that 
it will eventually reach a quiescent state. A switching network is quiescent if all 
its switches are quiescent. 

We denote by tt = (t, b) the state transition in which the token t is moved 
from an input wire to an output wire of a switch b. If a token t is on one of 
the input wires of a switch b at some network state s, we say that t is in front 
of b at state s or that the transition {t, b) is enabled at state s. An execution 
fragment a of the network is either a finite sequence Sq, tti, si, . . . , 7 t„, s„ or an 
infinite sequence sq, tti, si, . . . of alternating network states and transitions such 
that for each (sj, Si+i), the transition tt^+i is enabled at state Si and carries 
the network to state Sj+i. If 7Ti+i = (t, 6) we say that token t takes a step at 
state Si (or that t traverses b at state Si). An execution fragment beginning with 
an initial state is called an execution. If a is a finite execution fragment of the 
network and a' is any execution fragment that begins with the last state of a, 
then we write a • a' to represent the sequence obtained by concatenating a and 
a' and eliminating the duplicate occurrence of the last state of a. 
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For any token t, a t-solo execution fragment is an execution fragment in all 
transitions of which token t only takes steps. A t-complete execution fragment is 
an execution fragment at the final state of which token t has exited the network. 
A finite execution is complete if it results in a quiescent state. An execution 
is sequential if for any two transitions tt = (t,b) and tt = {t,b'), all transitions 
between them also involve token t; that is, tokens traverse the network one 
completely after the other. 

A switch b has the I -balancing property if, whenever I tokens reach each 
input wire of b then exactly I tokens exit on each of its output wires. We say 
that a switching network is an I -balancing network if all its switches preserve 
the Z-balancing property. It can be proved that in any execution a of an l- 
balancing network Af, in which no more than I tokens enter on any input wire 
of the network, there are never more than I tokens on any wire of the network. 

The latency of a switching network is the maximum number of switches 
traversed by any single token in any execution. The contention of an execution 
is the maximum number of tokens that are on the input wires of any particular 
switch at any point during the execution. The contention of a switching network 
is the maximum contention of any of its executions. In a one-shot execution, only 
n tokens (one per process) traverse the network. The one-shot contention of a 
switching network, denoted c, is the maximum contention over all its one-shot 
executions in which the n tokens are uniformly distributed on the input wires. 
For counting networks with f2(n) input wires, and for the switching network 
presented in Sectional c is constant. 

For any integer set S', an S-adding network A is a switching network that 
supports the operation fetch&add{-, v) only for values v G S. More formally, let 
Z > 0 be any integer and consider any complete execution a which involves I 
tokens ti, . . . ,t/. Assume that for each i, 1 < i < I, /3i is the weight of ti and Vi 
is the value taken by ti in a. The adding property for a states that there exists a 
permutation A, . . . , q of 1, . . . ,1, called the adding order, such that (1) = 0, 

and (2) for each j, 1 < j < I, = r'q + Pij ; that is, the first token in the 

order returns the value zero, and each subsequent token returns the sum of the 
weights of the tokens that precede it. We say that a switching network is an 
adding network if it is a Z-adding network, where Z is the set of integers. 



3 Lower Bound 

Consider an (a, &)-adding network such that |a| > |6| > 0. We may assume 
without loss of generality that a and b have no common factors, since any (a, b)- 
adding network can be trivially transformed to an (a ■ k,b ■ fc)-adding network, 
and vice-versa, for any non-zero integer k. Similarly, we can assume that a is 
positive. We show that in any sequential execution (involving any number of 
tokens) , tokens of weight b must traverse at least |"(n— 1)/ (c— 1)] switches, where 
c is the one-shot contention of the network. If |6| > 1, then in any sequential 
execution, tokens of weight a must also traverse the same number of switches. 
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We remark that our lower bound holds for all (a, 6) -adding networks, inde- 
pendently of e.g., the topology of the network (the width or the depth of the 
network, etc.) and the state of both the switches and the tokens. Moreover, our 
lower bound holds for all sequential executions involving any number of tokens 
(and not only for one-shot executions). 

Theorem 1. Consider an {a,b)-adding network A where |a| > |6| > 0. Then, 
in any sequential execution of A, 

1. each token of weight b traverses at least \(ji— l)/(c— 1)] switches, and 

2. if |6| > 1 then each token of weight a traverses at least \{n — l)/(c — 1)] 
switches. 

Proof. We prove something slightly stronger, that the stated lower bound holds 
for any token that goes through the network alone, independently of whether 
tokens before it have gone through the network sequentially. 

Start with the network in a quiescent state sq, denote by «o the execution 
with final state sq, and let t'l, . . . ,t[he the tokens involved in a^, where / > 0 is 
some integer. Denote by (3j,l < j < I, the weight of token t' and let v = Pj- 
The adding property implies that v is the next value to be taken by any token 
(serially) traversing the network. Let token t of weight x, x G {a, b}, traverse the 
network next. Let y € {a, b}, y ^ x,he the other value of {a, b}; that is, if x = a 
then y = b, and vice versa. Because |a| > |6| > 0, & ^ 0 (mod a). Thus, if x = 6, 
X ^ 0 (mod y). On the other hand, if x = a it again holds that x ^ 0 (mod y) 
because by assumption |y| > 1 and x,y have no common factors. 

Denote by B the set of switches that t traverses in a t-solo, t-complete exe- 
cution fragment from sq. 

Consider n— 1 tokens ti, . . . tn-i, all of weight y. We construct an execution 
in which each token ti, 1 < i < n — 1, must traverse some switch of B. Assume 
that all n tokens t,ti, . . . , t„_i are uniformly distributed on the input wires. 

Lemma 1. For each i, 1 < i < n — 1, there exists a ti-solo execution fragment 
with final state Si starting from state Si_i such that ti is in front of a switch 
bi G B at state Si. 

Proof. By induction onz, l<i<n— 1. 

Basis Case 

We claim that in the ti-solo, ti-complete execution fragment starting from 
state So, token ti traverses at least one switch of B. Suppose not. Denote by s'^ 
the final state of a^. Because A is an adding network, ti takes the value v in of^. 

Consider now the t-solo, t-complete execution fragment of{ starting from 
state Sp Since t\ does not traverse any switch of B, all switches traversed by t 
in a'l have the same state in sq and s^ Therefore, token t takes the same value 
in a'l as in the t-solo, t-complete execution fragment starting from sq. It follows 
that t takes the value v. 

We have constructed an execution in which both tokens t and ti take the 
value V. Since |a|, |&| > 0, this contradicts the adding property of A. 
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It follows that ti traverses at least one switch of B in a'^. Let a\ be the 
shortest prefix of a'l such that ti is in front of a switch bi G B at the final state 
Si of ai- 

Induction Hypothesis 

Assume inductively that for some i, 1 < i < n — 1, the claim holds for all 
1 < J that is, there exists an execution fragment aj with final state Sj 
starting from state Sj-i such that token tj is in front of a switch hj G B at state 

Sj. 

Induction Step 

For the induction step, we prove that in the t^-solo, ti-complete execution frag- 
ment a' starting from state Si_i, token ti traverses at least one switch of B. 
Suppose not. Denote by s' the final state of o'. Since t has taken no step in 
execution ai ■ . . . ■ Oi-i, the adding property of A implies that token ti takes 
value Vi = V (mod y). Consider now the f-solo, t-complete execution fragment 
a'l starting from state s'. By construction of the execution a\ ■ . . . ■ ai_i, to- 
kens ti, . . . , ti-i do not traverse any switch oi B in ai ■ . . . ■ Oi-i. Therefore, all 
switches traversed by t in a” have the same state at sq and s'. Thus, token t 
takes the value v in both a'( and in the t-solo, t-complete execution fragment 
starting from sq. 

Because A is an adding network, if t takes the value v, then ti must take value 
Vi = V + X (mod y), but we have just constructed an execution where t takes 
value V, and ti takes value Vi = v (mod y), which is a contradiction because 
X ^ 0 (mod y). Thus, token ti traverses at least one switch of in a'. Let ai 
be the shortest prefix of a' such that ti is in front of a switch bi G B at the final 
state Si of ai, to complete the proof of the induction step. 

At this point the proof of Lemma ^is complete. 

Let a = ai ■ . . . ■ an-i- Clearly, only n tokens, one per process, are involved 
in a and they are uniformly distributed on the input wires, so a is a one-shot 
execution. By Lemma ^ tokens ti, 1 < i < n — 1 are in front of switches 
of B at the final state of a. Notice also that all switches in B are in the same 
state at states sq and Sn-i- Thus, in the t-solo, t-complete execution fragment 
starting from state Sn-i token t traverses all switches of B. Because A has one- 
shot contention c, no more than c — 1 other tokens can be in front of any switch 
of B in a. Thus, B must contain at least switches. 

Since any S'-adding network, where IIS’! > 2, is an S"-adding network for all 
S' C S, Theorem ^ implies that in every sequential execution of the S'-adding 
network all tokens (except possibly those with maximum weight) traverse f2{njc) 
switches. 

We remark that the one-shot contention c of a large class of switching net- 
works, including conventional counting networks, is constant. For example, con- 
sider the class of switching networks with f?(n) input wires whose switches pro- 
duce any permutation of their input tokens on their output wires. A straight- 
forward induction argument shows that each switching network of this class has 
the 1-balancing property, and thus in a one-shot execution it never has more 
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Fig. 1. The Ladder switching network of layer depth 4. 



than one token on each wire. It follows that the one-shot contention of such a 
network is bounded by the maximum fan-in of any of its switches. By Theorem ^ 
all adding networks in this class, have I7(n) latency. Conventional counting net- 
works (and the Ladder adding network introduced in Section belong to this 
class. 

4 Upper Bound 

In this section, we show that the lower bound of Section 0 is essentially tight. 
We present a low-contention adding network, called Ladder, such that in any of 
its sequential executions tokens traverse 0(n) switches, while in its concurrent 
executions they traverse an average of 0{n) switches. The switching network 
described here has the same topology as the Skew counting network though 
its behavior is substantially different. A Ladder layer is an unbounded-depth 
switching network consisting of a sequence of binary switches bi, i > 0, that is, 
switches with /„ = font = 2. For switch bo, both input wires are input wires 
to the layer, while for each switch bi, * > 0, the first (or north) input wire is an 
output wire of switch 6i_i, while the second (or south) input wire is an input 
wire of the layer. The north output wire of any switch bi, i > 0, is an output 
wire of the layer, while the south output wire of bi is the north input wire of 
switch 

A Ladder switehing network of layer depth d is a switching network con- 
structed by layering d Ladder layers so that the i-th output wire of the one is 
the i-th input wire to the next. Clearly, the Ladder switching network has an 
infinite number of input and output wires. The Ladder adding network consists 
of a counting network followed by a Ladder switching network of layer depth n. 

Figure [D illustrates a Ladder switching network of layer depth 4. Switches 
are represented by fat vertical lines, while wires by horizontal arrows. Wires 
Xq, Xi, . . . , are the input wires of the network, while Yq,Yi, . . . , are its output 
wires. All switches for which one of their input wires is an input wire of the 
network belong to the first Ladder layer. All dashed wires belong to row 1. 

Each process moves its token through the counting network first, and uses 
the result to choose an input wire to the Ladder network. The counting network 
ensures that each input wire is chosen by exactly one token, and each switch is 
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visited by two tokens. A fresh switch is one that has never been visited by a token. 
Each switch s has the following state: a bit s. toggle that assumes values north 
and south, initially north, and an integer value s. weight, initially 0. The fields 
s. north and s. south are pointers to the (immutable) switches that are connected 
to s through its north and south output wires, respectively. 

Each token t has the following state: the t.arg field is the original weight of 
the token. The t. weight field is originally 0, and it accumulates the sum of the 
weights of tokens ordered before t. The t.wire field records whether the token 
will enter the next switch on its north or south input wire. 

Within Ladder, a token proceeds in two epochs, a north epoch followed by 
a south epoch. Tokens behave differently in different epochs. A token’s north 
epoch starts when the token enters Ladder, continues as long as it traverses fresh 
switches, and ends as soon as it traverses a non-fresh switch. When a north-epoch 
token visits a fresh switch, the following occurs atomically (1) s. toggle flips from 
north to south, and (2) s. weight is set to t. weight + t.arg. Then, t exits on the 
switch’s north wire. 

The first time a token visits a non-fresh switch, it adds that switch’s weight 
to its own, exits on the south wire, and enters its south epoch. Once a token 
enters its south epoch, it never moves “up” to a lower-numbered row. When 
a south-epoch token enters a switch on its south wire, it simply exits on the 
same wire (and same row), independently of the switch’s current state. When 
a south-epoch token enters a switch on its north wire, it does the following. If 
the switch is fresh, then, as before, it atomically sets the switch’s weight to the 
sum of its weight and argument, flips the toggle bit, and exits on the north wire 
(same row). If the switch is not fresh, it adds the switch’s weight to its own, 
and exits on the south wire (one row “down”). When the token emerges from 
Ladder, its current weight is its output value. 

All tokens other than the one that exits on the first output wire eventually 
reach a non-fresh switch. When a token t encounters its first non-fresh switch, 
then that switch’s weight is the sum of all the tokens that will precede t (so far) 
in the adding order. Each time the token enters a non-fresh switch on its north 
wire, it has been “overtaken” by an earlier token, so it moves down one row and 
adds this other token’s weight to its own. Figure |21 shows pseudo-code for the 
two epochs. For ease of presentation, the pseudocode shows the switch comple- 
menting its toggle field and updating its weight field in one atomic operation. 
However, a slightly more complicated construction can realize this state change 
as a simple atomic complement operation on the toggle bit. 

Even though Ladder has an unbounded number of switches, it can be im- 
plemented by a finite network by “folding” the network so that each folded 
switch simulates an unbounded number of primitive switches. A similar folding 
construction appears in |S|. 

Ladder is lock-free, but not wait-free. It is possible for a slow token to 
remain in the network forever if it is overtaken by infinitely many faster tokens. 

Proving that Ladder is an adding network is a major challenge of our anal- 
ysis. We point out that although Ladder has the same topology as Skew jHI, 
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void north_traverse(token t, switch s) { 
if {s.toggle == NORTH) { /* fresh */ 
atomically { 

s.toggle = SOUTH; 

s. weight = t.weight + t.arg\ 

} 

north_traverse(t, b.north); 

} else { /* not-so-fresh * / 
t.weight += s.weight-, 
t.wire = NORTH; 
south_traverse(t, b. south); 

} 

} 

void south_traverse(token t, switch s) { 
if {t.wire =— SOUTH) { /* ignore switch */ 
t.wire = NORTH; /* toggle wire */ 
south_traverse(t, s. south); 

} else { 

t.wire = SOUTH; /* toggle wire */ 
if {s.toggle == NORTH) { /* fresh */ 
atomically { 

s.toggle = SOUTH; 
s.weight = t.weight + t.arg; 

} 

south_traverse(t, s.north); 

} else { /* overtaken */ 

t. weight += s.weight; 
south_traverse(t, s. south); 

} 

} 

} 



Fig. 2. Pseudo-Code for Ladder Traversal. 



Ladder is substantially more powerful than Skew; we require a substantially 
different and more complicated analysis to prove its adding property. 

Theorem 2. Ladder is an adding network. 

The performance analysis of Ladder uses similar arguments as the one of 
Skew. This follows naturally from the fact that the two networks have the 
same topology and they both maintain the 1-balancing property (notice that, 
by knowing just the topology of a switching network, it is not always possible to 
analyze its performance because the way each token moves in the network may 
depend on both the state of the token and the state of any switch it traverses) . 

It can be proved that Ladder can itself be used to play the role of the con- 
ventional counting network. From now on, we assume that this is the case, that 
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is, the Ladder adding network consists only of the Ladder switching network 
(which it also uses as a traditional counting network). 

Theorem 3. (a) In any execution 0/ Ladder, each token traverses an average 

number of2n switches; 

(b) In any sequential execution of Ladder, each token traverses exactly 2n 

switches; 

(c) The contention of Ladder is 2. 

An execution of a switching network is linearizahle if for any two tokens 
t and f such that t' entered the network after t has exited it, it holds that 
Vt < Vf, where vt,vt' are the output values of t and t', respectively. For any 
adding network there exist executions which are linearizahle (e.g., executions in 
which all tokens have different weights which are powers of two) . For Ladder it 
holds that any of its executions is linearizahle. It has been proved Jf3, Theorem 
5.1, Section 5] that any non-blocking linearizahle counting network other than 
the trivial network with only one balancer has infinite number of input wires; 
that is, if all the executions of the network are linearizahle, then the network 
has infinite width. Although an adding network implements a fetch&increment 
operation (and thus it can serve as a counting network), this lower bound does 
not apply for adding networks because its proof uses the fact that counting 
networks consist only of balancers and counter objects which is not generally 
the case for switching networks. 

5 Discussion 

We close with some straightforward generalizations of our results. Consider a 
family of functions from values to values. Let 4> be an element of and x a 
variable. The read-modify-write operation [^, RMW{x, (p), atomically replaces 
the value of x with 4>{x), and returns the prior value of x. Most common synchro- 
nization primitives, such as fetch&add, swap, test&set, and compare&swap, can 
be cast as read-modify-write operations for suitable choices of (f). A read-modify- 
write network is one that supports read-modify-write operations. The Ladder 
network is easily extended to a read-modify-write network for any family of 
commutative functions (for all functions (j>,'ip € <P, and all values v, (p and ip are 
commutative, if and only if = ip{(j){v)y). A map (f can discern another 

map tp if pPpipfx)) 4>^{x) for some value x and all natural numbers k and £. 
Informally, one can always tell whether ip has been applied to a variable, even 
after repeated successive applications of (p. For example, if (p is addition by a and 
Ip addition by b, where |a| > |6| > 0, then (p can discern ip. Our lower bound can 
be generalized to show that if a switching network supports read-modify-write 
operations for two functions one of which can discern the other, then in any n- 
process sequential execution, processes traverse fl(n/c) switches before choosing 
a value. 

Perhaps the most important remaining open question is whether there exist 
low-contention wait-free adding networks (notice that since switching networks 
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do not contain cycles, any network with a finite number of switches would be 
wait-free). 



Acknowledgement. We would like to thank Faith Fich for many useful com- 
ments that improved the presentation of the paper. Our thanks go to the anony- 
mous DISC’Ol reviewers for their feedback. 

References 

1. Aharonson, E., Attiya, H.: Counting networks with arbitrary fan-out. Distributed 
Computing, 8 (1995) 163-169. 

2. Aiello, W., Busch, C., Herlihy, M., Mavronicolas, M., Shavit, N., Touitou, D.: Sup- 
porting Increment and Decrement Operations in Balancing Networks. Proceedings 
of the 16th International Symposium on Theoretical Aspects of Computer Science, 
pp. 393-403, Trier, Germany, March 1999. 

3. Aspnes, J., Herlihy, M., Shavit, N.: Counting Networks. Journal of the ACM, 41 
(1994) 1020-1048. 

4. Busch, C., Mavronicolas, M.: An Efficient Counting Network. Proceedings of the 
1st Merged International Parallel Processing Symposium and IEEE Symposium on 
Parallel and Distributed Processing, pp. 380-385, Orlando, Florida, May 1998. 

5. Goodman, J., Vernon, M., Woest, P.: Efficient synchronization primitives for large- 
scale cache-coherent multiprocessors. Proceedings of the 3rd International Confer- 
ence on Architectural Support for Programming Languages and Operating Systems, 
pp. 64-75, Boston, Massachusetts, April 1989. 

6. Herlihy, M., Shavit, N., Waarts, O.: Linearizable Counting Networks. Distributed 
Computing, 9 (1996) 193-203. 

7. Klugerman, M., Plaxton, C.: Small-Depth Counting Networks. Proceedings of the 
24 th Annual ACM Symposium on Theory of Computing, pp. 417-428, May 1992. 

8. Kruskal, C., Rudolph, L., Snir, M.: Efficient Synchronization on Multiprocessors 
with Shared Memory. Proceedings of the 5th Annual ACM Symposium on Princi- 
ples of Distributed Computing, pp. 218-228, Calgary, Canada, August 1986. 

9. Mavronicolas, M., Merritt, M., Taubenfeld, G.: Sequentially Consistent versus Lin- 
earizable Counting Networks. Proceedings of the 18th Annual ACM Symposium on 
Principles of Distributed Computing, pp. 133-142, May 1999. 

10. Moran, S., Taubenfeld, G.: A Lower Bound on Wait-Free Counting. Journal of 
Algorithms, 24 (1997) 1-19. 

11. Shavit, N., Touitou, D.: Elimination trees and the Construction of Pools and 
Stacks. Theory of Computing Systems, 30 (1997) 645-670. 

12. Wattenhofer, R., Widmayer, P.: An Inherent Bottleneck in Distributed Counting. 
Journal of Parallel and Distributed Computing, 49 (1998) 135-145. 




Author Index 



Aguilera, M.K. 108 
Alonso, G. 93 
Anderson, J.H. 1 
Arevalo, S. 93 

Barriere, L. 270 
Bold!, P. 33 

Chatzigiannakis, I. 285 

Delporte-Gallet, C. 108 
Dobrev, S. 166 
Douceur, J.R. 48 
Duflot, M. 240 

Fatourou, P. 330 
Fauconnier, H. 108 
Fich, F.E. 224 
Flocchini, P. 166 
Fraigniaud, P. 270 
Fribourg, L. 240 
Fujiwara, H. 123 

Garg, V.K. 78 
Georgiou, C. 151 

Harris, T.L. 300 
Herlihy, M. 136, 209, 330 
Herman, T. 315 
Higham, L. 194 
Hoepman, J.-H. 180 

Inoue, M. 123 

Jimenez-Peris, R. 93 
Johnen, C. 224 
Joung, Y.-J. 16 



Kim, Y.-J. 1 

Kranakis, E. 270 
Krizanc, D. 270 

Liang, Z. 194 

Malkhi, D. 63 
Masuzawa, T. 123, 315 
Mittal, N. 78 

Nikoletseas, S. 285 

Patino-Martmez, M. 93 
Pavlov, E. 63 
Peleg, D. 255 
Picaronny, C. 240 
Pincas, U. 255 
Prencipe, G. 166 

Rajsbaum, S. 136 
Russell, A. 151 

Santoro, N. 166 
Sella, Y. 63 
Shvartsman, A. A. 151 
Spirakis, P. 285 

Tirthapura, S. 209 
Toueg, S. 108 
Tuttle, M. 136 

Umetani, S. 123 

Vigna, S. 33 

Wattenhofer, R.P. 48 




